MPEGcompliant Entropy Decoding on FPGAaugmented TriMediaCPU64 - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

MPEGcompliant Entropy Decoding on FPGAaugmented TriMediaCPU64

Description:

MPEG-compliant Entropy Decoding. on FPGA-augmented TriMedia/CPU64 ... Predicative operations. 1. Functional Unit. Functional Unit ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 19
Provided by: cos155
Category:

less

Transcript and Presenter's Notes

Title: MPEGcompliant Entropy Decoding on FPGAaugmented TriMediaCPU64


1
MPEG-compliant Entropy Decodingon FPGA-augmented
TriMedia/CPU64
  • Mihai Sima Sorin Cotofana Stamatis
    Vassiliadis
  • Jos van Eijndhoven Kees Vissers
  • Delft University of Technology, Philips
    Research, TriMedia Technologies, Inc.
  • CE Colloquium
  • Computer Engineering Laboratory, TU Delft, Feb.
    07, 2002

2
Outline
  • Goals Assumptions
  • TriMedia/CPU64 architecture and software tools
  • Architectural extension for TriMedia/CPU64
  • MPEG theoretical background
  • Entropy decoder on standard/extended
    TriMedia/CPU64
  • Experimental framework
  • Experimental results
  • Conclusions

3
Goals assumptions
  • Custom Computing Machine
  • (General-Purpose) Processor augmented with an
    FPGA core
  • Basic idea
  • (General-Purpose) Processor medium performance
    for a large class of applications
  • FPGA flexibility to implement application-specifi
    c computations
  • To assess the performance of a hybrid
    TriMedia/CPU64 FPGA
  • TriMedia/CPU64
  • 5 issue-slot VLIW processor, media-processing
    oriented
  • FPGA ACEX 1K family
  • Improvements within TriMedia media processing
    domain
  • Benchmark MPEG-compliant entropy decoding

4
TriMedia/CPU64 architecture
  • 5 issue-slot VLIW processor
  • 64-bit datapath
  • Multimedia-oriented
  • Subword parallelism
  • Predicative operations
  • Single-slot operation
  • Super-operation
  • 2s complement C-style wrap-arround arithmetic
  • clipping, rounding
  • vector shuffle
  • multiply-and-sum
  • sum-of-absolute-differences
  • look-up
  • Software tools compiler, scheduler, assembler,
    simulator
  • treating a 64-bit word as a vector of 8-, 16-, or
    32-bit elements
  • FPGA-augmented TriMedia/CPU64

5
FPGA-augmented TriMedia/CPU64
  • Reconfigurable Functional Unit
  • receives instructions from instruction decoder
  • inputs from register file
  • outputs to register file
  • New instructions
  • SET
  • ACTIVATE
  • EXECUTE
  • User responsibility
  • appropriate EXECUTE instruction
  • semantics of the operation
  • latency of the operation
  • the issue slots

6
MPEG background
  • DCT Q force as many values as possible to
    zero
  • Zig-zag 8x8 coefficient matrix ? 64
    coefficient vector
  • RLC consecutive zeros represented by
    their run-length
  • ? the number of samples is reduced
  • VLC shorter codewords to frequently
    occuring symbols
  • ? the average bit rate is reduced

VLD is a sequential algorithm
Decoding the inverse operation of coding
7
Variable-length decoding
Entropy decoding
Run-length decoding
  • Code-length of the symbol is variable
  • Both input and output rate of a VLD cannot be
    kept constant
  • Three different variable-length decoders
  • constant-input-rate VLD decodes a fixed number
    of bits and produces a variable number of symbols
    per unit time
  • constant-output-rate VLD decodes one symbol per
    cycle regardless of its length
  • variable-input-output-rate VLD mixture of the
    first two
  • First, the number of zeros specified by run value
    is issued
  • Then, the level is passed through
  • Optimization for programmable-processor platform
  • an empty vector is filled in with level values at
    positions defined by run values
  • 217 128 Kwords 512 KB for direct mapping of
    all possible codewords
  • 234 16 Gwords 64 GB for direct mapping of all
    possible two-codeword combinations
  • Such a large memory is impractical for the time
    being

Both are sequential algorithms
8
Entropy decoding in software
  • Reference implementation Philips
  • Improvement 19
  • VL decoding strategy repeated table look-up
  • Each look-up decodes a variable chunk of bits
  • Hit ? run-level pair ? RLD by filling in the 8?8
    matrix
  • Miss ? offset and chunk size for the next look-up
  • Up to three look-ups are needed to decode a
    symbol
  • A single look-up takes minimum 13 cycles
  • 13 ? 39 cycles / symbol

Idea !
9
VLD on TriMedia/CPU64 RFU
  • What functionality to embed into FPGA ?
  • VLD must be balanced against the goal of making
    the whole entropy decoder fast.
  • No tri-state logic in FPGA
  • Barrel-shifting can be implemented by cascaded
    multiplexers selecting fixed-size shifting by
    1,2, 4, ...
  • Barrel-shifting is expensive in FPGA
  • Latency
  • 5628 bits (21 codewords) 10.2 ns
  • 8428 bits (31 codewords) 11.9 ns
  • 11256 bits (42 codewords) 15.0 ns
  • 3 TriMedia cycles
  • Barrel-shifting is cheap in standard TriMedia
  • 1 TriMedia cycle
  • RFU calls must have fixed latency
  • RFU latency
  • 1 cycle to read the arguments from register file
  • delay on FPGA
  • 1 cycle to write back the results to register
    file
  • Maximum 4 input 64-bit arguments
  • Maximum 2 output 64-bit results

10
VLD-1 on FPGA
  • Design parameters
  • Run/Level pair or End-of-Block per execution
  • Fixed latency
  • How to fulfill that?
  • First idea 17-input look-up table

TOO LARGE !
Latency 6 ? 7 TriMedia cycles
  • Second idea partition the VL codes into groups
    in order to allow for smaller look-up tables (LUT)

11
VLD-2 on FPGA
  • Two symbols per execution - how to do that ?
  • Trully two codewords at a time 64 GB look-up
    table

Latency 7 ? 8 TriMedia cycles
  • Early work (I)
  • run, level, code-length for 1st codeword
  • barrel-shift
  • run, level, code-length for 2nd codeword
  • Early work (II)
  • run, level, code-length for 1st codeword
  • run, level, code-length for all possible 2nd
    codewords
  • only a selection is carried out of the proper
    codeword
  • VLD-2 new idea !
  • run, level, code-length for 1st codeword
  • code-length for all possible 2nd codewords
  • only a selection is carried out of the proper
    code-length
  • the computation of the run and level for the 2nd
    codeword is postponed for the next VLD call
  • with the exception of a firing-up call, trully 2
    codewords per call is achieved for all subsequent
    calls

12
VLD-x on FPGA (x ? 3)
  • VLD-2 principle is scalable and can be extended
    to VLD-x (x?3)
  • VLD-3 two next / previous codewords are
    considered
  • Unfortunately, VLD-x (x?3) seems not to be
    feasible
  • the computation of the code-lengths for two next
    codewords is on the critical path
  • e.g., 12 TriMedia cycles are needed only to
    decode the code-lengths for current, and two next
    codewords
  • while the selection of the code-length of the
    next codeword in VLD-2 can be completed in about
    the same time with run-level decoding for the
    current and previous codewords
  • VLD-2 latency ? VLD-1 latency
  • Constraints related to TriMedia/CPU64
    super-operation format
  • too many values has to be returned by VLD call
  • packing difficulties

13
Entropy decoding on extended TriMedia
  • VLD-1-based entropy decoder
  • 1. Initializations
  • 2. VLD-1 call
  • 3. Field extraction
  • run, level, code-length, exit flag
  • 4. Updating accumulated code-length
  • 5. Exit if exit condition
  • 6. Run-length decoding
  • 7. Aligning the input string
  • 8. Go to 2.
  • Stage 6 can be folded into the loop
  • (software pipelining)
  • VLD-2-based entropy decoder
  • 1. Initializations
  • 2. VLD-2 call
  • 3. Field extraction run_p, level_p, run_c,
    level_c, code-length_c, code-length_n, exit flag
  • 4. Updating accumulated code-length
  • 5. Aligning the input string for previous
    codeword
  • 6. Updating accumulated code-length
  • 7. Exit if exit condition
  • 8. Run-length decoding for previous codeword
  • 9. Run-length decoding for current codeword
  • 10. Aligning the input string for current, next
    codeword
  • 11. Go to 2.
  • Stage 9 or both Stages 8 and 9
  • can be folded into the loop (software pipelining)

WATCH OUT ! Due to the higher complexity, the
overhead associated with loop firing-up may
become significant !
14
Entropy decoding computation overview
On standard TriMedia/CPU64
  • Hit / Miss mechanism
  • Up to 3 look-ups are needed to decode a symbol
  • A single look-up takes minimum 13 cycles
  • 13 ? 39 cycles / iteration

On FPGA-augmented TriMedia/CPU64
VLD-1-based entropy decoding
VLD-2-based entropy decoding
  • VLD on FPGA (latency 6?7 cycles)
  • One symbol per call
  • 11 cycles / iteration
  • VLD on FPGA (latency 7?8 cycles)
  • Two symbols per call
  • 17 cycles / iteration
  • Double software pipeline
  • at the VLD level
  • at the entropy decoder level

15
Experimental framework
  • Testing database preprocessed MPEG-conformance
    strings, from which all data not representing DCT
    coefficients has been removed ? only run-level
    and end-of-block symbols
  • MPEG strings are entirely resident into the main
    memory ? side effects like asynchronous
    interrupts, trashing routines, other operating
    system related tasks do not have to be counted
  • Zeroing the reconstructed 8?8 matrices are not
    counted also ? the run-length decoder overwrites
    the same 8?8 matrices again and again.
  • The only relevant metric number of the
    instruction cycles needed to perform strictly
    entropy decoding
  • Two experiment classes
  • entropy decoding loop is left on End-of-Block
  • entropy decoding loop is left on
    End-of-Macro-Block
  • Two strategies
  • VLD returns run
  • VLD returns non-zero-coefficient-position
    relative to block/macro-block

16
Experimental results
17
Remarks
  • Pure-software solution 16.0 ? 17.0 cycles /
    symbol
  • VLD-2 based 8.0 ? 8.5 cycles / symbol
  • Only 2.5 slots of 5 are filled in
  • VLD-2 latency 8 cycles
  • Barrel-shifting latency 1 cycle
  • Extract runs, levels, code-lengths 1 cycle
  • If end-of-block, then exit the loop 3 cycles
  • Store the level values 3 cycles

FPGA-based VLD latency is the bottleneck !
18
Conclusions
2 ?
  • Entropy decoding
  • Hardware penalty one EP1K100 FPGA (100,000
    gates)
  • Bottleneck the latency of FPGA-based VLD
  • Can we do better ?
  • lots of open questions ...
  • Future work
  • Testing with HDTV strings
  • Motion compensation
  • YUV to RGB converter
  • Full MPEG decoder
Write a Comment
User Comments (0)
About PowerShow.com