Title: Greedy Algorithms (Huffman Coding)
1Greedy Algorithms(Huffman Coding)
2Huffman Coding
Original file
- A technique to compress data effectively
- Usually between 20-90 compression
- Lossless compression
- No information is lost
- When decompress, you get the original file
3Huffman Coding Applications
Huffman coding
Compressed file
Original file
- Saving space
- Store compressed files instead of original files
- Transmitting files or data
- Send compressed data to save transmission time
and power - Encryption and decryption
- Cannot read the compressed file without knowing
the key
4Main Idea Frequency-Based Encoding
- Assume in this file only 6 characters appear
- E, A, C, T, K, N
- The frequencies are
Character Frequency
E 10,000
A 4,000
C 300
T 200
K 100
N 100
- Option I (No Compression)
- Each character 1 Byte (8 bits)
- Total file size 14,700 8 117,600 bits
- Option 2 (Fixed size compression)
- We have 6 characters, so we need
- 3 bits to encode them
- Total file size 14,700 3 44,100 bits
Character Fixed Encoding
E 000
A 001
C 010
T 100
K 110
N 111
5Main Idea Frequency-Based Encoding(Contd)
- Assume in this file only 6 characters appear
- E, A, C, T, K, N
- The frequencies are
Character Frequency
E 10,000
A 4,000
C 300
T 200
K 100
N 100
- Option 3 (Huffman compression)
- Variable-length compression
- Assign shorter codes to more frequent characters
and longer codes to less frequent characters - Total file size
Char. HuffmanEncoding
E 0
A 10
C 110
T 1110
K 11110
N 11111
(10,000 x 1) (4,000 x 2) (300 x 3) (200 x
4) (100 x 5) (100 x 5) 20,700 bits
6Huffman Coding
- A variable-length coding for characters
- More frequent characters ? shorter codes
- Less frequent characters ? longer codes
- It is not like ASCII coding where all characters
have the same coding length (8 bits) - Two main questions
- How to assign codes (Encoding process)?
- How to decode (from the compressed file, generate
the original file) (Decoding process)?
7Decoding for fixed-length codes is much easier
010001100110111000
Character Fixed-length Encoding
E 000
A 001
C 010
T 100
K 110
N 111
Divide into 3s
010 001 100 110 111 000
Decode
C A T K N E
8Decoding for variable-length codes is not that
easy
000001
Character Variable-length Encoding
E 0
A 00
C 001
Huffman encoding guarantees to avoid this
uncertainty Always have a single decoding
9Huffman Algorithm
- Step 1 Get Frequencies
- Scan the file to be compressed and count the
occurrence of each character - Sort the characters based on their frequency
- Step 2 Build Tree Assign Codes
- Build a Huffman-code tree (binary tree)
- Traverse the tree to assign codes
- Step 3 Encode (Compress)
- Scan the file again and replace each character by
its code - Step 4 Decode (Decompress)
- Huffman tree is the key to decompress the file
10Step 1 Get Frequencies
Input File
Eerie eyes seen near lake.
11Step 2 Build Huffman Tree Assign Codes
- It is a binary tree in which each character is a
leaf node - Initially each node is a separate root
- At each step
- Select two roots with smallest frequency and
connect them to a new parent (Break ties
arbitrary) The greedy choice - The parent will get the sum of frequencies of the
two child nodes - Repeat until you have one root
12Example
Each char. has a leaf node with its frequency
13Find the smallest two frequenciesReplace them
with their parent
14Find the smallest two frequenciesReplace them
with their parent
15Find the smallest two frequenciesReplace them
with their parent
16Find the smallest two frequenciesReplace them
with their parent
17Find the smallest two frequenciesReplace them
with their parent
18Find the smallest two frequenciesReplace them
with their parent
19Find the smallest two frequenciesReplace them
with their parent
20Find the smallest two frequenciesReplace them
with their parent
21Find the smallest two frequenciesReplace them
with their parent
22Find the smallest two frequenciesReplace them
with their parent
23Find the smallest two frequenciesReplace them
with their parent
24Now we have a single rootThis is the Huffman Tree
25Lets Analyze Huffman Tree
- All characters are at the leaf nodes
- The number at the root of characters in the
file - High-frequency chars (E.g., e) are near the
root - Low-frequency chars are far from the root
26Lets Assign Codes
- Traverse the tree
- Any left edge ? add label 0
- As right edge ? add label 1
- The code for each character is its root-to-leaf
label sequence
27Lets Assign Codes
1
0
0
0
1
1
1
0
0
1
0
1
1
0
0
1
1
0
0
1
1
0
- Traverse the tree
- Any left edge ? add label 0
- As right edge ? add label 1
- The code for each character is its root-to-leaf
label sequence
28Lets Assign Codes
- Traverse the tree
- Any left edge ? add label 0
- As right edge ? add label 1
- The code for each character is its root-to-leaf
label sequence
29Huffman Algorithm
- Step 1 Get Frequencies
- Scan the file to be compressed and count the
occurrence of each character - Sort the characters based on their frequency
- Step 2 Build Tree Assign Codes
- Build a Huffman-code tree (binary tree)
- Traverse the tree to assign codes
- Step 3 Encode (Compress)
- Scan the file again and replace each character by
its code - Step 4 Decode (Decompress)
- Huffman tree is the key to decompess the file
30Step 3 Encode (Compress) The File
Input File
Eerie eyes seen near lake.
0000
10
1100
0001
10
.
Notice that no code is prefix to any other code
? Ensures the decoding will be unique (Unlike
Slide 8)
31Step 4 Decode (Decompress)
- Must have the encoded file the coding tree
- Scan the encoded file
- For each 0 ? move left in the tree
- For each 1 ? move right
- Until reach a leaf node ? Emit that character and
go back to the root
32Huffman Algorithm
- Step 1 Get Frequencies
- Scan the file to be compressed and count the
occurrence of each character - Sort the characters based on their frequency
- Step 2 Build Tree Assign Codes
- Build a Huffman-code tree (binary tree)
- Traverse the tree to assign codes
- Step 3 Encode (Compress)
- Scan the file again and replace each character by
its code - Step 4 Decode (Decompress)
- Huffman tree is the key to decompess the file