Implementation of Huffman Decoder - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Implementation of Huffman Decoder

Description:

High throughput requirement for VLC codec. Need more than 200 Mb/s for decoder speed ... The pipelined constant-input rate PLA-based architecture for VLC decoder. ... – PowerPoint PPT presentation

Number of Views:437

Avg rating:3.0/5.0

Slides: 38

Provided by: YuHe8

Category:

more less

Transcript and Presenter's Notes

Title: Implementation of Huffman Decoder

1
Implementation of Huffman Decoder

Originally presented by Sheng, Xiaohong

2
Outline

Introduction
Huffman Code
Tree-Based Architecture
PLA-Based Architecture
Comparison of Tree-Based and PLA-Based
Architectures
Concurrent Decoding
Summary

3
Introduction

VLC is an optimal code
Approximate the source entropy
High throughput requirement for VLC codec
Need more than 200 Mb/s for decoder speed
Difficult to implement high throughput decoder
Obscured codeword boundary
High throughput VLC decoder architecture
developed in the paper
Tree-based architecture
PLA-based architecture

4
Huffman code

Probability distribution based code

Huffman Tree
Source Statistics
5
Huffman Code (Continue)

Advantage
Efficiency
Instantaneous
Can be decoded as soon as they are received
Exhaustive- uniquely
Disadvantage
Variable-Length
Boundary is sequentially dependence
Error Propagation

6
Huffman Encode and Decode

Encoder
Lookup table
Fixed length inputs
Can be encoded independently and so, can be
encoded parallel to increase the throughput
Decoder
FSM model
Each state represents a node of the tree
Output
Decoded symbols
A single-bit flag

7
Conventional VLC Decoder
Codebook size256, Source symbol length8bits
1b/cycle
8
How to Construct the Codebook

Prerequisite
Knowledge of the source symbol probabilities
e.g, statistics of DPCM is different from PCM
Choose Optimal architecture according to source
statistics

Constructing a Huffman code based on the source
statistics of the output signal of an image
compression system
9
Directly Mapped Tree-Based Architecture

1b/cycleconventional decoder speed
Clock rate is much faster than conventional
decoder

10
Pipelined Tree Architecture (1)

Partition the decoder into pipeline stages
each stage includes one level of Huffman tree
Condition
Independent bit streams
Total number of input streams Lmax( the depth of
Huffman tree)

11
Pipelined Tree Architecture(2)
Use the pipelined tree-based architecture to
decode multiple independent streams of data
concurrently
12
Pipelined Tree Architecture (3)
An architecture for a high-speed variable-length
rotation shifter
13
Pipelined Tree Architecture(4)

Single ROM look-up table
Implement all the branching and storage at each
level
Incoming bit and control message(from previous
message)
Constitute a complete address to read out the
codewords terminated in this level
Produce the control message required by the next
stage
Implemented by cascading Lmax ROMs

14
Performance Analyses on pipelined Tree-Based
Architectures (1)

Output throughput
one codeword per cycle
Input bit of each bit stream is decoded through
different decoding paths
Multiplexed output codewords are interleaved and
directed to individual buffers for storing
decoded results of each bit stream
Critical path
One propagation delay of rotation shifter
One ROM access delay
One Pipeline latch delay

15
Performance Analyses on Tree-Based
Architectures(2)

Advantage
Smaller ROM ?smaller clock cycle time
Constant output rate
Disadvantage
Not flexible
need independent bit streams
Not very efficiency if bit stream is finite
blocks of the same length contain different
number of codewords
Lmin ltaverage throughput ltLmax b/cycles

16
PLA-Based Architectures-Overview

Decoder model-FSM
FSM Implement
ROM
PLA
Lower complexity
high speed
PLA Implementation
constant input rate
constant output rate
variable I/O rate

17
Constant-Input Rate PLA Based Architecture

Input rate K bits/cycle
PLA undertakes table lookup process

Input bits
Determine one unique path along Huffman tree
Next state
feed back to the input of PLA to indicate the
final residing state
Indicator
The number of symbols decoded in the cycle

The constant-input rate PLA-based architecture
for the VLC decoder
18
Constant-Output-Rate PLA-Based Architecture (1)
The constant output rate PLA based architecture
for VLX decoder
19
Constant-Output-Rate PLA-Based Architecture (2)

Output rate one codeword/cycle
Barrel shifter stores 32b window of input data
PLA takes 16b at its input to decode one codeword
Output of PLA includes the actual length of
decoded codeword
Length is accumulated and used to memorize the
new starting position within the 32b window to
get the next 16 b for decoding
The carry bit of the accumulator is used to issue
a data fetch command and get next word from the
input buffer

20
Variable I/O Rate PLA-Based Architecture
The variable I/O rate PLA-based for the VLC
decoder
21
High-Level PLA Optimization Techniques-Overview

Logical minimization ? PLA complexity
Performed by VLSI CAD systems (ESPRESSO)
PLA decomposition ? Clock cycle time? throughput

Product term stands for Input-Output one-one
correspondence Complexity determined by input
width, output width and number of product
terms Higher PLA complexity longer clock cycle
time and larger silicon area
22
High-Level PLA Optimization Techniques-complexity
table
Comparisons of complexity and throughput of
PLA-Based architectures
23
PLA decomposition(1)

PLA decomposition
Split a single PLA truth table into several
subset
Remove dont care input or zero-value output bits
Operate all decomposed PLA in parallel
Add OR gates to accomplish the overall
functional correctness
Reduce PLA cycle time
Increase OR gate delay time

24
PLA decomposition(2)
PLA specification suitable for decomposition
Using OR gates to combine the outputs from
different decomposed PLA subblocks
25
Pipelined PLA-Based Architecture

Separate the PLA outputs into several stages

The pipelined constant-input rate PLA-based
architecture for VLC decoder. The first stage
decodes the next state output only. The second
stage decodes the symbols contained in the
current block of input
26
Performance Analyses on PLA-Based Architectures

Net decoder information rate number of
bits/cycleclock rate
Decoder Bits or codewords/cycle?? PLA complicity
?? clock rate?
Chip Size ? its circuit complexity ? (2IO)P
I number of input terms
O number of output terms
P number of product terms

27
PLA Based Architecture Implementation and
Simulation Results

Original complexity
17 inputs, 43 outputs, 65536 product terms
After logical optimization
product terms is reduced to 4665
After pipelined technique
The complexity of the first stage PLA has 17
inputs, 8 outputs and 1582 product terms, and can
be decomposed to 64 small PLAs
Second stage, whole PLA can be decomposed, and
the cascaded OR gates can be pipelined into as
many levels as necessary

28
Comparison of Tree-based and PLA-based
Architecture

Tree based architectures
Regular structure
Short clock cycle
Partial programmable
Bit streams need to be independent
PLA-based architectures
Not programmable ( PLA optimization techniques
are contents dependent)
more flexible
Input/output can be fixed or variable to meet
real application requirement

29
Why need Concurrency decoding?

Tree based Architecture or PLA based Architecture
are
Application dependent
Architecture dependent
Sometimes, technology dependent
Helpful for specific architectures but are
limited in general
So, we need
General method to improve the decoding speed of
an arbitrary decoder architecture through
concurrency

30
Concurrent decoding for multiple Sources

Individual color components
Subsampling or filtering to decompose an image
Concurrency level is the number of coded bit
streams or number of components
Disadvantage
Amount of concurrency is application dependent
and insufficient
Individually coded components may have different
channel rates, different peak coding rates,
different source symbol statistics and thus
different codebooks. And so can hardly share the
same concurrent decoding hardware

31
Concurrent decoding for a single sourceBasic
Concurrent Techniques (1)

FSM
If input is constant, the decoder is a
synchronous FSM that doesnt always generate
output

32
Basic Concurrent Techniques (2)

To run an FSM K times faster, table lookup takes
K inputs in parallel per cycle
Table size is NAk ( Original is NA)

33
Concurrent decoding with Bit-Position Method(1)

Based on block postcomputation principle
Break the coded bit stream into segments
Decode concurrently for all segments
Process dependency among codeword boundaries
Concurrency level is unlimited theoretical

34
Concurrent decoding with Bit-Position Method(2)

Example
Incoming bits is divided into blocks of M bits
Overlapping window has Lmax bits, which is the
maximum codeword size
The codeword boundary is within Lmax

35
Concurrent decoding with Bit-Controlled Coding

Encoder finds the codeword for the current input
symbol, buffers the codeword bits, and sends the
bits out sequentially.
Select block length M bits
Counter tracks the number of bits generated by
the encoder and resets every M bits for signaling
the block boundaries
If the codeword is complete, the codeword
boundary is aligned with block boundary
If not, the encoder repeats the buffered bits of
the incomplete codeword in the next block and
then continues to complete the codeword.

36
Summary