Title: Data Compression Basics
1Data Compression Basics Huffman Coding
- Motivation of Data Compression.
- Lossless and Lossy Compression Techniques.
- Static Lossless Compression Huffman Coding.
- Correctness of Huffman Coding prefix property.
2Why Data Compression?
- Data storage and transmission cost money. This
cost increases with the amount of data available. - This cost can be reduced by processing the data
so that it takes less memory and less
transmission time. - Data transmission is faster by using better
transmission media or by compressing the data. - Data compression algorithms reduce the size of a
given data without affecting its content.
Examples - . Huffman coding
- . Run-Length coding
- . Lempel-Ziv coding
3 Lossless and Lossy Compression Techniques
- Data compression techniques are broadly
classified into lossless and lossy. - Lossless techniques enable exact reconstruction
of the original document from the compressed
information while lossy techniques do not. - Run-length, Huffman and Lempel-Ziv are lossless
while JPEG and MPEG are lossy techniques. - Lossy techniques usually achieve higher
compression rates than lossless ones but the
latter are more accurate.
4Lossless and Lossy Compression Techniques (cont'd)
- Lempel-Ziv reads variable-sized input and outputs
fixed length bits while Huffman coding is the
exact opposite. - Lossless techniques are classified into static
and adaptive. - In a static scheme, like Huffman coding, the data
is first scanned to obtain statistical
information before compression begins. - Adaptive models like Lempel-Ziv begin with an
initial statistical distribution of the text
symbols but modifies this distribution as each
character or word is encoded. - Adaptive schemes fit the text more closely but
static schemes involve less computations and are
faster.
5 Introduction to Huffman Coding
- What is the likelihood that all symbols in a
message to be transmitted have the same number of
occurrences? - Huffman coding assigns different bits to
characters based on their frequency of
occurrences in the given message. - The string to be transmitted is first analysed to
find the relative frequencies of its constituent
characters. - The coding process generates a binary tree, the
Huffman code tree, with branches labeled with
bits (0 and 1). - The Huffman tree must be sent with the compressed
information to enable the receiver decode the
message.
6 Example 1 Huffman Coding
- Example 1 Information to be transmitted over the
internet contains the following characters with
their associated frequencies as shown in the
following table - .Use Huffman technique to answer the following
questions - Build the Huffman code tree for the message.
- Use the Huffman tree to find the codeword for
each character. - If the data consists of only these characters,
what is the total number of bits to be
transmitted? What is the percentage saving if the
data is sent with 8-bit ASCII values without
compression? - Verify that your computed Huffman codewords are
correct.
t s o n l e a Characters
53 22 18 45 13 65 45 Frequency
7Example 1 Huffman Coding (Solution)
- Solution The Huffman coding process uses a
priority queue and binary trees using the
frequencies. - We begin by filling the priority queue with
one-node binary trees each containing a frequency
count and the symbol with that frequency. - The initial priority queue is built by arranging
the one-node binary trees in decreasing order of
frequency. - The object with the lowest priority is designated
as the front of the queue. - At each step, the priority queue is manipulated
as outlined next
8 Example 1 Huffman Coding (Solution)
- The priority queue is manipulated as follows
- 1. Dequeue two trees from the front of the queue.
-
- 2. Construct a new binary tree from the two trees
as follows - a. Construct a new tree by using the two trees
that were dequeued as - the left and right subtrees of the new tree
- b. Give the new tree the priority that is the sum
of the priorities of its left and right subtrees. - 3. Enqueue the new tree using as its priority the
sum of the priorities of the two trees used to
construct it. - 4. Continue this process until only one tree is
in the priority queue.
9Example 1 Huffman Coding Step 1
- front
- l o s n a
t e - 13 18 22 45 45 53
65
10Example 1 Solution (cont'd)
- front
- s n a
t e - 22 31 45 45
53 65 - l o
11Example 1 Solution (cont'd)
- front
- n a
t e - 45 45 53
53 65 - s 31
- l
o
12Example 1 Solution (cont'd)
- front
- t e
- 53 53 65
90 - s 31
n a - l o
13Example 1 Solution (cont'd)
- front
- e
- 65 90
106 - n a 53
t - s
31 -
l o
14Example 1 Solution (cont'd)
- front
- 106 155
- 53 t e
90 - s 31 n
a - l o
-
15Example 1 Solution (cont'd)
- 261
-
- 106 155
- 53 t e
90 - s 31 n
a - l o
-
-
-
-
16Example 1 Solution (cont'd)
- 261
-
- 106 155
-
- 53 t e
90 -
- s 31 n
a - l o
-
-
-
-
1
0
1
1
0
0
1
0
0
1
0
1
17Example 1 Solution (cont'd)
- 261
-
- 106 155
-
- 53 t e
90 -
- s 31 n
a - l o
-
-
-
-
1
0
1
1
0
0
1
0
0
1
0
1
18 Example 1 Solution (cont'd)
- The sequence of zeros and ones that are the arcs
in the path from the root to each terminal node
are the desired codes - Character a e l
n o s
t - if we assume the message consists of only the
characters a,e,l,n,o,s and t then the number of
bits transmitted will be - 265253345345322418413 696 bits
- If the message is sent uncompressed with 8-bit
ASCII - representation for the characters, we have
- 2618 2088 bits, i.e. we saved about 70
transmission time. -
01 000 0011 110 0010 10 111 Codeword
19 Example 1 Solution The Prefix Property
- Data encoded using Huffman coding is uniquely
decodable. This is because Huffman codes satisfy
an important property called the prefix property. - This property guarantees that no codeword is a
prefix of another Huffman codeword - For example, 10 and 101 cannot simultaneously be
valid Huffman codewords because the first is a
prefix of the second. - Thus, any bitstream is uniquely decodable with a
given Huffman code. - We can see by inspection that the codewords we
generated (shown in the preceding slide) are
valid Huffman codewords.
20Exercises
- Using the Huffman tree constructed in this
session, decode the following sequence of bits,
if possible. Otherwise, where does the decoding
fail? - 10100010111010001000010011
- Using the Huffman tree construted in this
session, write the bit sequences that encode the
messages - test , state , telnet , notes
-
- Mention one disadvantage of a lossless
compression scheme and one disadvantage of a
lossy compression scheme. - Write a Java program that implements the Huffman
coding algorithm.