Title: Implementation of Huffman Decoder
1Implementation of Huffman Decoder
- Originally presented by Sheng, Xiaohong
2Outline
- Introduction
- Huffman Code
- Tree-Based Architecture
- PLA-Based Architecture
- Comparison of Tree-Based and PLA-Based
Architectures - Concurrent Decoding
- Summary
3Introduction
- VLC is an optimal code
- Approximate the source entropy
- High throughput requirement for VLC codec
- Need more than 200 Mb/s for decoder speed
- Difficult to implement high throughput decoder
- Obscured codeword boundary
- High throughput VLC decoder architecture
developed in the paper - Tree-based architecture
- PLA-based architecture
4Huffman code
- Probability distribution based code
Huffman Tree
Source Statistics
5Huffman Code (Continue)
- Advantage
- Efficiency
- Instantaneous
- Can be decoded as soon as they are received
- Exhaustive- uniquely
- Disadvantage
- Variable-Length
- Boundary is sequentially dependence
- Error Propagation
6Huffman Encode and Decode
- Encoder
- Lookup table
- Fixed length inputs
- Can be encoded independently and so, can be
encoded parallel to increase the throughput - Decoder
- FSM model
- Each state represents a node of the tree
- Output
- Decoded symbols
- A single-bit flag
7Conventional VLC Decoder
Codebook size256, Source symbol length8bits
1b/cycle
8How to Construct the Codebook
- Prerequisite
- Knowledge of the source symbol probabilities
- e.g, statistics of DPCM is different from PCM
- Choose Optimal architecture according to source
statistics
Constructing a Huffman code based on the source
statistics of the output signal of an image
compression system
9Directly Mapped Tree-Based Architecture
- 1b/cycleconventional decoder speed
- Clock rate is much faster than conventional
decoder
10Pipelined Tree Architecture (1)
- Partition the decoder into pipeline stages
- each stage includes one level of Huffman tree
- Condition
- Independent bit streams
- Total number of input streams Lmax( the depth of
Huffman tree)
11Pipelined Tree Architecture(2)
Use the pipelined tree-based architecture to
decode multiple independent streams of data
concurrently
12Pipelined Tree Architecture (3)
An architecture for a high-speed variable-length
rotation shifter
13Pipelined Tree Architecture(4)
- Single ROM look-up table
- Implement all the branching and storage at each
level - Incoming bit and control message(from previous
message) - Constitute a complete address to read out the
codewords terminated in this level - Produce the control message required by the next
stage - Implemented by cascading Lmax ROMs
14Performance Analyses on pipelined Tree-Based
Architectures (1)
- Output throughput
- one codeword per cycle
- Input bit of each bit stream is decoded through
different decoding paths - Multiplexed output codewords are interleaved and
directed to individual buffers for storing
decoded results of each bit stream - Critical path
- One propagation delay of rotation shifter
- One ROM access delay
- One Pipeline latch delay
15Performance Analyses on Tree-Based
Architectures(2)
- Advantage
- Smaller ROM ?smaller clock cycle time
- Constant output rate
- Disadvantage
- Not flexible
- need independent bit streams
- Not very efficiency if bit stream is finite
- blocks of the same length contain different
number of codewords - Lmin ltaverage throughput ltLmax b/cycles
16PLA-Based Architectures-Overview
- Decoder model-FSM
- FSM Implement
- ROM
- PLA
- Lower complexity
- high speed
- PLA Implementation
- constant input rate
- constant output rate
- variable I/O rate
17Constant-Input Rate PLA Based Architecture
- Input rate K bits/cycle
- PLA undertakes table lookup process
- Input bits
- Determine one unique path along Huffman tree
- Next state
- feed back to the input of PLA to indicate the
final residing state - Indicator
- The number of symbols decoded in the cycle
The constant-input rate PLA-based architecture
for the VLC decoder
18Constant-Output-Rate PLA-Based Architecture (1)
The constant output rate PLA based architecture
for VLX decoder
19Constant-Output-Rate PLA-Based Architecture (2)
- Output rate one codeword/cycle
- Barrel shifter stores 32b window of input data
- PLA takes 16b at its input to decode one codeword
- Output of PLA includes the actual length of
decoded codeword - Length is accumulated and used to memorize the
new starting position within the 32b window to
get the next 16 b for decoding - The carry bit of the accumulator is used to issue
a data fetch command and get next word from the
input buffer
20Variable I/O Rate PLA-Based Architecture
The variable I/O rate PLA-based for the VLC
decoder
21High-Level PLA Optimization Techniques-Overview
- Logical minimization ? PLA complexity
- Performed by VLSI CAD systems (ESPRESSO)
- PLA decomposition ? Clock cycle time? throughput
Product term stands for Input-Output one-one
correspondence Complexity determined by input
width, output width and number of product
terms Higher PLA complexity longer clock cycle
time and larger silicon area
22High-Level PLA Optimization Techniques-complexity
table
Comparisons of complexity and throughput of
PLA-Based architectures
23PLA decomposition(1)
- PLA decomposition
- Split a single PLA truth table into several
subset - Remove dont care input or zero-value output bits
- Operate all decomposed PLA in parallel
- Add OR gates to accomplish the overall
functional correctness - Reduce PLA cycle time
- Increase OR gate delay time
24PLA decomposition(2)
PLA specification suitable for decomposition
Using OR gates to combine the outputs from
different decomposed PLA subblocks
25Pipelined PLA-Based Architecture
- Separate the PLA outputs into several stages
The pipelined constant-input rate PLA-based
architecture for VLC decoder. The first stage
decodes the next state output only. The second
stage decodes the symbols contained in the
current block of input
26Performance Analyses on PLA-Based Architectures
- Net decoder information rate number of
bits/cycleclock rate - Decoder Bits or codewords/cycle?? PLA complicity
?? clock rate? - Chip Size ? its circuit complexity ? (2IO)P
- I number of input terms
- O number of output terms
- P number of product terms
27PLA Based Architecture Implementation and
Simulation Results
- Original complexity
- 17 inputs, 43 outputs, 65536 product terms
- After logical optimization
- product terms is reduced to 4665
- After pipelined technique
- The complexity of the first stage PLA has 17
inputs, 8 outputs and 1582 product terms, and can
be decomposed to 64 small PLAs - Second stage, whole PLA can be decomposed, and
the cascaded OR gates can be pipelined into as
many levels as necessary
28Comparison of Tree-based and PLA-based
Architecture
- Tree based architectures
- Regular structure
- Short clock cycle
- Partial programmable
- Bit streams need to be independent
- PLA-based architectures
- Not programmable ( PLA optimization techniques
are contents dependent) - more flexible
- Input/output can be fixed or variable to meet
real application requirement
29Why need Concurrency decoding?
- Tree based Architecture or PLA based Architecture
are - Application dependent
- Architecture dependent
- Sometimes, technology dependent
- Helpful for specific architectures but are
limited in general - So, we need
- General method to improve the decoding speed of
an arbitrary decoder architecture through
concurrency
30Concurrent decoding for multiple Sources
- Individual color components
- Subsampling or filtering to decompose an image
- Concurrency level is the number of coded bit
streams or number of components - Disadvantage
- Amount of concurrency is application dependent
and insufficient - Individually coded components may have different
channel rates, different peak coding rates,
different source symbol statistics and thus
different codebooks. And so can hardly share the
same concurrent decoding hardware
31Concurrent decoding for a single sourceBasic
Concurrent Techniques (1)
- FSM
- If input is constant, the decoder is a
synchronous FSM that doesnt always generate
output
32Basic Concurrent Techniques (2)
- To run an FSM K times faster, table lookup takes
K inputs in parallel per cycle - Table size is NAk ( Original is NA)
33Concurrent decoding with Bit-Position Method(1)
- Based on block postcomputation principle
- Break the coded bit stream into segments
- Decode concurrently for all segments
- Process dependency among codeword boundaries
- Concurrency level is unlimited theoretical
34Concurrent decoding with Bit-Position Method(2)
- Example
- Incoming bits is divided into blocks of M bits
- Overlapping window has Lmax bits, which is the
maximum codeword size - The codeword boundary is within Lmax
35Concurrent decoding with Bit-Controlled Coding
- Encoder finds the codeword for the current input
symbol, buffers the codeword bits, and sends the
bits out sequentially. - Select block length M bits
- Counter tracks the number of bits generated by
the encoder and resets every M bits for signaling
the block boundaries - If the codeword is complete, the codeword
boundary is aligned with block boundary - If not, the encoder repeats the buffered bits of
the incomplete codeword in the next block and
then continues to complete the codeword.
36Summary
- Two classes of high-speed VLC decoder
- Tree-based architectures
- PLA-based architectures
- Achieve
- Single-chip implementation
- Over 200Mb/s decoding speed
- Concurrency decoding methods
- Multiple source concurrency decoding
- Single source concurrency decoding
- Basic Concurrent Technique
- Bit-Positioning Method
- Controlled coding Method
37Useful Links
- http//castle.ee.nctu.edu.tw/mountain/paper.html
- http//www-us6.semiconductors.com/pip/TDA8043H
- http//www.dolby.com/tech/ac-3mult.html
- http//www.infotech.tu-chemnitz.de/microtec/eng/p
ress/annualrep98/mpeg.htm - http//www.ece.wisc.edu/hu/ece734/references/chan
g92a.pdf - http//www.ece.wisc.edu/hu/ece734/references/chan
g92b.pdf - http//ieeexplore.ieee.org/iel5/6126/16375/0075566
1.pdf