Implementation of Huffman Decoder - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Implementation of Huffman Decoder

Description:

High throughput requirement for VLC codec. Need more than 200 Mb/s for decoder speed ... The pipelined constant-input rate PLA-based architecture for VLC decoder. ... – PowerPoint PPT presentation

Number of Views:437
Avg rating:3.0/5.0
Slides: 38
Provided by: YuHe8
Category:

less

Transcript and Presenter's Notes

Title: Implementation of Huffman Decoder


1
Implementation of Huffman Decoder
  • Originally presented by Sheng, Xiaohong

2
Outline
  • Introduction
  • Huffman Code
  • Tree-Based Architecture
  • PLA-Based Architecture
  • Comparison of Tree-Based and PLA-Based
    Architectures
  • Concurrent Decoding
  • Summary

3
Introduction
  • VLC is an optimal code
  • Approximate the source entropy
  • High throughput requirement for VLC codec
  • Need more than 200 Mb/s for decoder speed
  • Difficult to implement high throughput decoder
  • Obscured codeword boundary
  • High throughput VLC decoder architecture
    developed in the paper
  • Tree-based architecture
  • PLA-based architecture

4
Huffman code
  • Probability distribution based code

Huffman Tree
Source Statistics
5
Huffman Code (Continue)
  • Advantage
  • Efficiency
  • Instantaneous
  • Can be decoded as soon as they are received
  • Exhaustive- uniquely
  • Disadvantage
  • Variable-Length
  • Boundary is sequentially dependence
  • Error Propagation

6
Huffman Encode and Decode
  • Encoder
  • Lookup table
  • Fixed length inputs
  • Can be encoded independently and so, can be
    encoded parallel to increase the throughput
  • Decoder
  • FSM model
  • Each state represents a node of the tree
  • Output
  • Decoded symbols
  • A single-bit flag

7
Conventional VLC Decoder
Codebook size256, Source symbol length8bits
1b/cycle
8
How to Construct the Codebook
  • Prerequisite
  • Knowledge of the source symbol probabilities
  • e.g, statistics of DPCM is different from PCM
  • Choose Optimal architecture according to source
    statistics

Constructing a Huffman code based on the source
statistics of the output signal of an image
compression system
9
Directly Mapped Tree-Based Architecture
  • 1b/cycleconventional decoder speed
  • Clock rate is much faster than conventional
    decoder

10
Pipelined Tree Architecture (1)
  • Partition the decoder into pipeline stages
  • each stage includes one level of Huffman tree
  • Condition
  • Independent bit streams
  • Total number of input streams Lmax( the depth of
    Huffman tree)

11
Pipelined Tree Architecture(2)
Use the pipelined tree-based architecture to
decode multiple independent streams of data
concurrently
12
Pipelined Tree Architecture (3)
An architecture for a high-speed variable-length
rotation shifter
13
Pipelined Tree Architecture(4)
  • Single ROM look-up table
  • Implement all the branching and storage at each
    level
  • Incoming bit and control message(from previous
    message)
  • Constitute a complete address to read out the
    codewords terminated in this level
  • Produce the control message required by the next
    stage
  • Implemented by cascading Lmax ROMs

14
Performance Analyses on pipelined Tree-Based
Architectures (1)
  • Output throughput
  • one codeword per cycle
  • Input bit of each bit stream is decoded through
    different decoding paths
  • Multiplexed output codewords are interleaved and
    directed to individual buffers for storing
    decoded results of each bit stream
  • Critical path
  • One propagation delay of rotation shifter
  • One ROM access delay
  • One Pipeline latch delay

15
Performance Analyses on Tree-Based
Architectures(2)
  • Advantage
  • Smaller ROM ?smaller clock cycle time
  • Constant output rate
  • Disadvantage
  • Not flexible
  • need independent bit streams
  • Not very efficiency if bit stream is finite
  • blocks of the same length contain different
    number of codewords
  • Lmin ltaverage throughput ltLmax b/cycles

16
PLA-Based Architectures-Overview
  • Decoder model-FSM
  • FSM Implement
  • ROM
  • PLA
  • Lower complexity
  • high speed
  • PLA Implementation
  • constant input rate
  • constant output rate
  • variable I/O rate

17
Constant-Input Rate PLA Based Architecture
  • Input rate K bits/cycle
  • PLA undertakes table lookup process
  • Input bits
  • Determine one unique path along Huffman tree
  • Next state
  • feed back to the input of PLA to indicate the
    final residing state
  • Indicator
  • The number of symbols decoded in the cycle

The constant-input rate PLA-based architecture
for the VLC decoder
18
Constant-Output-Rate PLA-Based Architecture (1)
The constant output rate PLA based architecture
for VLX decoder
19
Constant-Output-Rate PLA-Based Architecture (2)
  • Output rate one codeword/cycle
  • Barrel shifter stores 32b window of input data
  • PLA takes 16b at its input to decode one codeword
  • Output of PLA includes the actual length of
    decoded codeword
  • Length is accumulated and used to memorize the
    new starting position within the 32b window to
    get the next 16 b for decoding
  • The carry bit of the accumulator is used to issue
    a data fetch command and get next word from the
    input buffer

20
Variable I/O Rate PLA-Based Architecture
The variable I/O rate PLA-based for the VLC
decoder
21
High-Level PLA Optimization Techniques-Overview
  • Logical minimization ? PLA complexity
  • Performed by VLSI CAD systems (ESPRESSO)
  • PLA decomposition ? Clock cycle time? throughput

Product term stands for Input-Output one-one
correspondence Complexity determined by input
width, output width and number of product
terms Higher PLA complexity longer clock cycle
time and larger silicon area
22
High-Level PLA Optimization Techniques-complexity
table
Comparisons of complexity and throughput of
PLA-Based architectures
23
PLA decomposition(1)
  • PLA decomposition
  • Split a single PLA truth table into several
    subset
  • Remove dont care input or zero-value output bits
  • Operate all decomposed PLA in parallel
  • Add OR gates to accomplish the overall
    functional correctness
  • Reduce PLA cycle time
  • Increase OR gate delay time

24
PLA decomposition(2)
PLA specification suitable for decomposition
Using OR gates to combine the outputs from
different decomposed PLA subblocks
25
Pipelined PLA-Based Architecture
  • Separate the PLA outputs into several stages

The pipelined constant-input rate PLA-based
architecture for VLC decoder. The first stage
decodes the next state output only. The second
stage decodes the symbols contained in the
current block of input
26
Performance Analyses on PLA-Based Architectures
  • Net decoder information rate number of
    bits/cycleclock rate
  • Decoder Bits or codewords/cycle?? PLA complicity
    ?? clock rate?
  • Chip Size ? its circuit complexity ? (2IO)P
  • I number of input terms
  • O number of output terms
  • P number of product terms

27
PLA Based Architecture Implementation and
Simulation Results
  • Original complexity
  • 17 inputs, 43 outputs, 65536 product terms
  • After logical optimization
  • product terms is reduced to 4665
  • After pipelined technique
  • The complexity of the first stage PLA has 17
    inputs, 8 outputs and 1582 product terms, and can
    be decomposed to 64 small PLAs
  • Second stage, whole PLA can be decomposed, and
    the cascaded OR gates can be pipelined into as
    many levels as necessary

28
Comparison of Tree-based and PLA-based
Architecture
  • Tree based architectures
  • Regular structure
  • Short clock cycle
  • Partial programmable
  • Bit streams need to be independent
  • PLA-based architectures
  • Not programmable ( PLA optimization techniques
    are contents dependent)
  • more flexible
  • Input/output can be fixed or variable to meet
    real application requirement

29
Why need Concurrency decoding?
  • Tree based Architecture or PLA based Architecture
    are
  • Application dependent
  • Architecture dependent
  • Sometimes, technology dependent
  • Helpful for specific architectures but are
    limited in general
  • So, we need
  • General method to improve the decoding speed of
    an arbitrary decoder architecture through
    concurrency

30
Concurrent decoding for multiple Sources
  • Individual color components
  • Subsampling or filtering to decompose an image
  • Concurrency level is the number of coded bit
    streams or number of components
  • Disadvantage
  • Amount of concurrency is application dependent
    and insufficient
  • Individually coded components may have different
    channel rates, different peak coding rates,
    different source symbol statistics and thus
    different codebooks. And so can hardly share the
    same concurrent decoding hardware

31
Concurrent decoding for a single sourceBasic
Concurrent Techniques (1)
  • FSM
  • If input is constant, the decoder is a
    synchronous FSM that doesnt always generate
    output

32
Basic Concurrent Techniques (2)
  • To run an FSM K times faster, table lookup takes
    K inputs in parallel per cycle
  • Table size is NAk ( Original is NA)

33
Concurrent decoding with Bit-Position Method(1)
  • Based on block postcomputation principle
  • Break the coded bit stream into segments
  • Decode concurrently for all segments
  • Process dependency among codeword boundaries
  • Concurrency level is unlimited theoretical

34
Concurrent decoding with Bit-Position Method(2)
  • Example
  • Incoming bits is divided into blocks of M bits
  • Overlapping window has Lmax bits, which is the
    maximum codeword size
  • The codeword boundary is within Lmax

35
Concurrent decoding with Bit-Controlled Coding
  • Encoder finds the codeword for the current input
    symbol, buffers the codeword bits, and sends the
    bits out sequentially.
  • Select block length M bits
  • Counter tracks the number of bits generated by
    the encoder and resets every M bits for signaling
    the block boundaries
  • If the codeword is complete, the codeword
    boundary is aligned with block boundary
  • If not, the encoder repeats the buffered bits of
    the incomplete codeword in the next block and
    then continues to complete the codeword.

36
Summary
  • Two classes of high-speed VLC decoder
  • Tree-based architectures
  • PLA-based architectures
  • Achieve
  • Single-chip implementation
  • Over 200Mb/s decoding speed
  • Concurrency decoding methods
  • Multiple source concurrency decoding
  • Single source concurrency decoding
  • Basic Concurrent Technique
  • Bit-Positioning Method
  • Controlled coding Method

37
Useful Links
  • http//castle.ee.nctu.edu.tw/mountain/paper.html
  • http//www-us6.semiconductors.com/pip/TDA8043H
  • http//www.dolby.com/tech/ac-3mult.html
  • http//www.infotech.tu-chemnitz.de/microtec/eng/p
    ress/annualrep98/mpeg.htm
  • http//www.ece.wisc.edu/hu/ece734/references/chan
    g92a.pdf
  • http//www.ece.wisc.edu/hu/ece734/references/chan
    g92b.pdf
  • http//ieeexplore.ieee.org/iel5/6126/16375/0075566
    1.pdf
Write a Comment
User Comments (0)
About PowerShow.com