Title: Huffman coding
1Huffman coding
2Optimal codes - I
- A code is optimal if it has the shortest codeword
length L - This can be seen as an optimization problem
3Optimal codes - II
- Lets make two simplifying assumptions
- no integer constraint on the codelengths
- Kraft inequality holds with equality
- Lagrange-multiplier problem
4Optimal codes - III
- Substitute into the Kraft
inequality - that is
- Note that
the entropy, when we use base D for logarithms
5Optimal codes - IV
- In practice the codeword lengths must be integer
value, so obtained results is a lower bound - Theorem
- The expected length of any istantaneous D-ary
code for a r.v. X satisfies - this fundamental result derives frow the work of
Shannon
6Optimal codes - V
- What about the upper bound?
- Theorem
- Given a source alphabet (i.e. a r.v.) of entropy
it is possible to find an instantaneous
binary code which length satisfies - A similar theorem could be stated if we use the
wrong probabilities instead of the true
ones the only difference is a term which
accounts for the relative entropy
7The redundance
- It is defined as the average codeword legths
minus the entropy - Note that
- (why?)
8Compression ratio
- It is the ratio between the average number of
bit/symbol in the original message and the same
quantity for the coded message, i.e.
9Uniquely decodable codes
- The set of the instantaneous codes are a small
subset of the uniquely decodable codes. - It is possible to obtain a lower average code
length L using a uniquely decodable code that is
not instantaneous? NO - So we use instantaneous codes that are easier to
decode
10Summary
- Average codeword length L
- for uniquely decodable codes (and
for instantaneous codes) - In practice for each r.v. with entropy
we can build a code with average codeword length
that satisfies
11Shannon-Fano coding
- The main advantage of the Shannon-Fano technique
is its semplicity - Source symbols are listed in order of
nonincreasing probability. - The list is divided in such a way to form two
groups of as nearly equal probabilities as
possible - Each symbol in the first group receives a 0 as
first digit of its codeword, while the others
receive a 1 - Each of these group is then divided according to
the same criterion and additional code digits are
appended - The process is continued until each group
contains only one message
12example
-
- H1.9375 bits
- L1.9375 bits
13Shannon-Fano coding - exercise
- Encode, using Shannon-Fano algorithm
14Is Shannon-Fano coding optimal?
L12.3 bits
15Huffman coding - I
- There is another algorithm which performances are
slightly better than Shanno-Fano, the famous
Huffman coding - It works constructing bottom-up a tree, that has
symbols in the leafs - The two leafs with the smallest probabilities
becomes sibling under a parent node with
probabilities equal to the two childrens
probabilities
16Huffman coding - II
- At this time the operation is repeated,
considering also the new parent node and ignoring
its children - The process continue until there is only parent
node with probability 1, that is the root of the
tree - Then the two branches for every non-leaf node are
labeled 0 and 1 (typically, 0 on the left branch,
but the order is not important)
17Huffman coding - example
1.0
1.0
0
1
0.4
0.4
1
0.2
0.6
0.2
0.6
0
0
1
1
0.1
0.3
0.1
0.3
0
0
1
1
a 0.05
b 0.05
c 0.1
d 0.2
e 0.3
f 0.2
g 0.1
a 0.05
b 0.05
c 0.1
d 0.2
e 0.3
f 0.2
g 0.1
18Huffman coding - example
- Exercise evaluate H(X) and L(X)
- H(X)2.5464 bits
- L(X)2.6 bits !!
19Huffman coding - exercise
- Code the sequence
- aeebcddegfced and calculate the
compression ratio - Sol 0000 10 10 0001 001 01 01
- 10 111 110 001 10 01
- Aver. orig. symb. length 3 bits
- Aver. compr. symb. length 34/13
- C.....
20Huffman coding - exercise
- Decode the sequence
- 0111001001000001111110
- Sol dfdcadgf
21Huffman coding - exercise
- Encode with Huffman the sequence
- 01cc0a02ba10
- and evaluate entropy, average codeword length
and compression ratio
22Huffman coding - exercise
- Decode (if possible) the Huffman coded bit
streaming - 01001011010011110101...
23Huffman coding - notes
- In the huffman coding, if, at any time, there is
more than one way to choose a smallest pair of
probabilities, any such pair may be chosen - Sometimes, the list of probabilities is
inizialized to be non-increasing and reordered
after each node creation. This details doesnt
affect the correctness of the algorithm, but it
provides a more efficient implementation
24Huffman coding - notes
- There are cases in which the Huffman coding does
not uniquely determine codeword lengths, due to
the arbitrary choice among equal minimum
probabilities. - For example for a source with probabilities
- it is possible to obtain
codeword lengths of and of - It would be better to have a code which
codelength has the minimum variance, as this
solution will need the minimum buffer space in
the transmitter and in the receiver
25Huffman coding - notes
- Schwarz defines a variant of the Huffman
algorithm that allows to build the code with
minimum . - There are several other variants, we will explain
the most important in a while.
26Optimality of Huffman coding - I
- It is possible to prove that, in case of
character coding (one symbol, one codeword),
Huffman coding is optimal - In another terms Huffman code has minimum
redundancy - An upper bound for redundancy has been found
- where is the probability of the most likely
simbol
27Optimality of Huffman coding - II
- Why Huffman code suffers when there is one
symbol with very high probability? - Remember the notion of uncertainty...
- The main problem is given by the integer
constraint on codelengths!! - This consideration opens the way to a more
powerful coding... we will see it later
28Huffman coding - implementation
- Huffman coding can be generated in O(n) time,
where n is the number of source symbols, provided
that probabilities have been presorted (however
this sort costs O(nlogn)...) - Nevertheless, encoding is very fast
29Huffman coding - implementation
- However, spatial and temporal complexity of the
decoding phase are far more important, because,
on average, decoding will happen more frequently. - Consider a Huffman tree with n symbols
- n leafs and n-1 internal nodes
-
has the pointer to a symbol and the info that it
is a leaf
has two pointers
30Huffman coding - implementation
- 1 million symbols 16 MB of memory!
- Moreover traversing a tree from root to leaf
involves follow a lot of pointers, with little
locality of reference. This causes several page
faults or cache misses. - To solve this problem a variant of Huffman coding
has been proposed canonical Huffman coding
31canonical Huffman coding - I
1.0
0
1
(0)
(1)
0.53
0.47
0
1
(0)
(1)
0
(1)
1
(0)
0.23
0.27
0
0
1
1
(0)
(0)
(1)
(1)
?
b 0.12
c 0.13
d 0.14
e 0.24
f 0.26
a 0.11
32canonical Huffman coding - II
- This code cannot be obtained through a Huffman
tree! - We do call it an Huffman code because it is
instantaneous and the codeword lengths are the
same than a valid Huffman code - numerical sequence property
- codewords with the same length are ordered
lexicographically - when the codewords are sorted in lexical order
they are also in order from the longest to the
shortest codeword
33canonical Huffman coding - III
- The main advantage is that it is not necessary to
store a tree, in order to decoding - We need
- a list of the symbols ordered according to the
lexical order of the codewords - an array with the first codeword of each distinct
length
34canonical Huffman coding - IV
- Encoding. Suppose there are n disctinct symbols,
that for symbol i we have calculated huffman
codelength and
numlk number of codewords with length
k firstcodek integer for first code of
length k nextcodek integer for the next
codeword of length k to be assigned symbol-,-
used for decoding codewordi the rightmost
bits of this integer are the code for symbol i
35canonical Huffman - example
- 2. Evaluate array firstcode
- 3. Construct array codeword and symbol
symbol
0 1 2 3
1 2 3 4 5
- - - -
a e h -
d - - -
- - - -
b c f g
36canonical Huffman coding - V
- Decoding. We have the arrays firstcode and symbols
nextinputbit() function that returns next input
bit firstcodek integer for first code of
length k symbolk,n returns the symbol number
n with codelength k
37canonical Huffman - example
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
symbol3,0 d
symbol3,0 d
symbol2,2 h
symbol2,2 h
symbol2,1 e
symbol2,1 e
symbol5,0 b
symbol5,0 b
symbol
0 1 2 3
symbol2,0 a
symbol2,0 a
1 2 3 4 5
- - - -
a e h -
d - - -
- - - -
b c f g
symbol3,0 d
symbol3,0 d