Title: MPEGcompliant Entropy Decoding on FPGAaugmented TriMediaCPU64
1MPEG-compliant Entropy Decodingon FPGA-augmented
TriMedia/CPU64
- Mihai Sima Sorin Cotofana Stamatis
Vassiliadis - Jos van Eijndhoven Kees Vissers
- Delft University of Technology, Philips
Research, TriMedia Technologies, Inc. - CE Colloquium
- Computer Engineering Laboratory, TU Delft, Feb.
07, 2002
2Outline
- Goals Assumptions
- TriMedia/CPU64 architecture and software tools
- Architectural extension for TriMedia/CPU64
- MPEG theoretical background
- Entropy decoder on standard/extended
TriMedia/CPU64 - Experimental framework
- Experimental results
- Conclusions
3Goals assumptions
- Custom Computing Machine
- (General-Purpose) Processor augmented with an
FPGA core - Basic idea
- (General-Purpose) Processor medium performance
for a large class of applications - FPGA flexibility to implement application-specifi
c computations - To assess the performance of a hybrid
TriMedia/CPU64 FPGA - TriMedia/CPU64
- 5 issue-slot VLIW processor, media-processing
oriented - FPGA ACEX 1K family
- Improvements within TriMedia media processing
domain - Benchmark MPEG-compliant entropy decoding
4TriMedia/CPU64 architecture
- 5 issue-slot VLIW processor
- 2s complement C-style wrap-arround arithmetic
- clipping, rounding
- vector shuffle
- multiply-and-sum
- sum-of-absolute-differences
- look-up
- Software tools compiler, scheduler, assembler,
simulator
- treating a 64-bit word as a vector of 8-, 16-, or
32-bit elements
- FPGA-augmented TriMedia/CPU64
5FPGA-augmented TriMedia/CPU64
- Reconfigurable Functional Unit
- receives instructions from instruction decoder
- inputs from register file
- outputs to register file
- New instructions
- SET
- ACTIVATE
- EXECUTE
- User responsibility
- appropriate EXECUTE instruction
- semantics of the operation
- latency of the operation
- the issue slots
6MPEG background
- DCT Q force as many values as possible to
zero - Zig-zag 8x8 coefficient matrix ? 64
coefficient vector - RLC consecutive zeros represented by
their run-length - ? the number of samples is reduced
- VLC shorter codewords to frequently
occuring symbols - ? the average bit rate is reduced
VLD is a sequential algorithm
Decoding the inverse operation of coding
7Variable-length decoding
Entropy decoding
Run-length decoding
- Code-length of the symbol is variable
- Both input and output rate of a VLD cannot be
kept constant - Three different variable-length decoders
- constant-input-rate VLD decodes a fixed number
of bits and produces a variable number of symbols
per unit time - constant-output-rate VLD decodes one symbol per
cycle regardless of its length - variable-input-output-rate VLD mixture of the
first two
- First, the number of zeros specified by run value
is issued - Then, the level is passed through
- Optimization for programmable-processor platform
- an empty vector is filled in with level values at
positions defined by run values
- 217 128 Kwords 512 KB for direct mapping of
all possible codewords - 234 16 Gwords 64 GB for direct mapping of all
possible two-codeword combinations - Such a large memory is impractical for the time
being
Both are sequential algorithms
8Entropy decoding in software
- Reference implementation Philips
- Improvement 19
- VL decoding strategy repeated table look-up
- Each look-up decodes a variable chunk of bits
- Hit ? run-level pair ? RLD by filling in the 8?8
matrix - Miss ? offset and chunk size for the next look-up
- Up to three look-ups are needed to decode a
symbol - A single look-up takes minimum 13 cycles
- 13 ? 39 cycles / symbol
Idea !
9VLD on TriMedia/CPU64 RFU
- What functionality to embed into FPGA ?
- VLD must be balanced against the goal of making
the whole entropy decoder fast.
- No tri-state logic in FPGA
- Barrel-shifting can be implemented by cascaded
multiplexers selecting fixed-size shifting by
1,2, 4, ... - Barrel-shifting is expensive in FPGA
- Latency
- 5628 bits (21 codewords) 10.2 ns
- 8428 bits (31 codewords) 11.9 ns
- 11256 bits (42 codewords) 15.0 ns
- 3 TriMedia cycles
- Barrel-shifting is cheap in standard TriMedia
- 1 TriMedia cycle
- RFU calls must have fixed latency
- RFU latency
- 1 cycle to read the arguments from register file
- delay on FPGA
- 1 cycle to write back the results to register
file - Maximum 4 input 64-bit arguments
- Maximum 2 output 64-bit results
10VLD-1 on FPGA
- Design parameters
- Run/Level pair or End-of-Block per execution
- Fixed latency
- How to fulfill that?
- First idea 17-input look-up table
TOO LARGE !
Latency 6 ? 7 TriMedia cycles
- Second idea partition the VL codes into groups
in order to allow for smaller look-up tables (LUT)
11VLD-2 on FPGA
- Two symbols per execution - how to do that ?
- Trully two codewords at a time 64 GB look-up
table
Latency 7 ? 8 TriMedia cycles
- Early work (I)
- run, level, code-length for 1st codeword
- barrel-shift
- run, level, code-length for 2nd codeword
- Early work (II)
- run, level, code-length for 1st codeword
- run, level, code-length for all possible 2nd
codewords - only a selection is carried out of the proper
codeword
- VLD-2 new idea !
- run, level, code-length for 1st codeword
- code-length for all possible 2nd codewords
- only a selection is carried out of the proper
code-length - the computation of the run and level for the 2nd
codeword is postponed for the next VLD call - with the exception of a firing-up call, trully 2
codewords per call is achieved for all subsequent
calls
12VLD-x on FPGA (x ? 3)
- VLD-2 principle is scalable and can be extended
to VLD-x (x?3) - VLD-3 two next / previous codewords are
considered - Unfortunately, VLD-x (x?3) seems not to be
feasible - the computation of the code-lengths for two next
codewords is on the critical path - e.g., 12 TriMedia cycles are needed only to
decode the code-lengths for current, and two next
codewords - while the selection of the code-length of the
next codeword in VLD-2 can be completed in about
the same time with run-level decoding for the
current and previous codewords - VLD-2 latency ? VLD-1 latency
- Constraints related to TriMedia/CPU64
super-operation format - too many values has to be returned by VLD call
- packing difficulties
13Entropy decoding on extended TriMedia
- VLD-1-based entropy decoder
- 1. Initializations
- 2. VLD-1 call
- 3. Field extraction
- run, level, code-length, exit flag
- 4. Updating accumulated code-length
- 5. Exit if exit condition
- 6. Run-length decoding
- 7. Aligning the input string
- 8. Go to 2.
- Stage 6 can be folded into the loop
- (software pipelining)
- VLD-2-based entropy decoder
- 1. Initializations
- 2. VLD-2 call
- 3. Field extraction run_p, level_p, run_c,
level_c, code-length_c, code-length_n, exit flag - 4. Updating accumulated code-length
- 5. Aligning the input string for previous
codeword - 6. Updating accumulated code-length
- 7. Exit if exit condition
- 8. Run-length decoding for previous codeword
- 9. Run-length decoding for current codeword
- 10. Aligning the input string for current, next
codeword - 11. Go to 2.
- Stage 9 or both Stages 8 and 9
- can be folded into the loop (software pipelining)
WATCH OUT ! Due to the higher complexity, the
overhead associated with loop firing-up may
become significant !
14Entropy decoding computation overview
On standard TriMedia/CPU64
- Hit / Miss mechanism
- Up to 3 look-ups are needed to decode a symbol
- A single look-up takes minimum 13 cycles
- 13 ? 39 cycles / iteration
On FPGA-augmented TriMedia/CPU64
VLD-1-based entropy decoding
VLD-2-based entropy decoding
- VLD on FPGA (latency 6?7 cycles)
- One symbol per call
- 11 cycles / iteration
- VLD on FPGA (latency 7?8 cycles)
- Two symbols per call
- 17 cycles / iteration
- Double software pipeline
- at the VLD level
- at the entropy decoder level
15Experimental framework
- Testing database preprocessed MPEG-conformance
strings, from which all data not representing DCT
coefficients has been removed ? only run-level
and end-of-block symbols - MPEG strings are entirely resident into the main
memory ? side effects like asynchronous
interrupts, trashing routines, other operating
system related tasks do not have to be counted - Zeroing the reconstructed 8?8 matrices are not
counted also ? the run-length decoder overwrites
the same 8?8 matrices again and again. - The only relevant metric number of the
instruction cycles needed to perform strictly
entropy decoding - Two experiment classes
- entropy decoding loop is left on End-of-Block
- entropy decoding loop is left on
End-of-Macro-Block - Two strategies
- VLD returns run
- VLD returns non-zero-coefficient-position
relative to block/macro-block
16Experimental results
17Remarks
- Pure-software solution 16.0 ? 17.0 cycles /
symbol - VLD-2 based 8.0 ? 8.5 cycles / symbol
- Only 2.5 slots of 5 are filled in
- VLD-2 latency 8 cycles
- Barrel-shifting latency 1 cycle
- Extract runs, levels, code-lengths 1 cycle
- If end-of-block, then exit the loop 3 cycles
- Store the level values 3 cycles
FPGA-based VLD latency is the bottleneck !
18Conclusions
2 ?
- Hardware penalty one EP1K100 FPGA (100,000
gates) - Bottleneck the latency of FPGA-based VLD
- Can we do better ?
- lots of open questions ...
- Future work
- Testing with HDTV strings
- Motion compensation
- YUV to RGB converter
- Full MPEG decoder