MPEGcompliant Entropy Decoding on FPGAaugmented TriMediaCPU64 - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

MPEGcompliant Entropy Decoding on FPGAaugmented TriMediaCPU64

Description:

MPEG-compliant Entropy Decoding. on FPGA-augmented TriMedia/CPU64 ... Predicative operations. 1. Functional Unit. Functional Unit ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 19

Provided by: cos155

Category:

more less

Transcript and Presenter's Notes

Title: MPEGcompliant Entropy Decoding on FPGAaugmented TriMediaCPU64

1
MPEG-compliant Entropy Decodingon FPGA-augmented
TriMedia/CPU64

Mihai Sima Sorin Cotofana Stamatis
Vassiliadis
Jos van Eijndhoven Kees Vissers
Delft University of Technology, Philips
Research, TriMedia Technologies, Inc.
CE Colloquium
Computer Engineering Laboratory, TU Delft, Feb.
07, 2002

2
Outline

Goals Assumptions
TriMedia/CPU64 architecture and software tools
Architectural extension for TriMedia/CPU64
MPEG theoretical background
Entropy decoder on standard/extended
TriMedia/CPU64
Experimental framework
Experimental results
Conclusions

3
Goals assumptions

Custom Computing Machine
(General-Purpose) Processor augmented with an
FPGA core
Basic idea
(General-Purpose) Processor medium performance
for a large class of applications
FPGA flexibility to implement application-specifi
c computations
To assess the performance of a hybrid
TriMedia/CPU64 FPGA
TriMedia/CPU64
5 issue-slot VLIW processor, media-processing
oriented
FPGA ACEX 1K family
Improvements within TriMedia media processing
domain
Benchmark MPEG-compliant entropy decoding

4
TriMedia/CPU64 architecture

5 issue-slot VLIW processor

64-bit datapath

Multimedia-oriented

Subword parallelism

Predicative operations

Single-slot operation

Super-operation

2s complement C-style wrap-arround arithmetic
clipping, rounding
vector shuffle
multiply-and-sum
sum-of-absolute-differences
look-up

Software tools compiler, scheduler, assembler,
simulator

treating a 64-bit word as a vector of 8-, 16-, or
32-bit elements

FPGA-augmented TriMedia/CPU64

5
FPGA-augmented TriMedia/CPU64

Reconfigurable Functional Unit
receives instructions from instruction decoder
inputs from register file
outputs to register file
New instructions
SET
ACTIVATE
EXECUTE

User responsibility
appropriate EXECUTE instruction
semantics of the operation
latency of the operation
the issue slots

6
MPEG background

DCT Q force as many values as possible to
zero
Zig-zag 8x8 coefficient matrix ? 64
coefficient vector
RLC consecutive zeros represented by
their run-length
? the number of samples is reduced
VLC shorter codewords to frequently
occuring symbols
? the average bit rate is reduced

VLD is a sequential algorithm
Decoding the inverse operation of coding
7
Variable-length decoding
Entropy decoding
Run-length decoding

Code-length of the symbol is variable
Both input and output rate of a VLD cannot be
kept constant
Three different variable-length decoders
constant-input-rate VLD decodes a fixed number
of bits and produces a variable number of symbols
per unit time
constant-output-rate VLD decodes one symbol per
cycle regardless of its length
variable-input-output-rate VLD mixture of the
first two

First, the number of zeros specified by run value
is issued
Then, the level is passed through
Optimization for programmable-processor platform
an empty vector is filled in with level values at
positions defined by run values

217 128 Kwords 512 KB for direct mapping of
all possible codewords
234 16 Gwords 64 GB for direct mapping of all
possible two-codeword combinations
Such a large memory is impractical for the time
being

Both are sequential algorithms
8
Entropy decoding in software

Reference implementation Philips
Improvement 19

VL decoding strategy repeated table look-up
Each look-up decodes a variable chunk of bits
Hit ? run-level pair ? RLD by filling in the 8?8
matrix
Miss ? offset and chunk size for the next look-up

Up to three look-ups are needed to decode a
symbol
A single look-up takes minimum 13 cycles
13 ? 39 cycles / symbol

Idea !
9
VLD on TriMedia/CPU64 RFU

What functionality to embed into FPGA ?
VLD must be balanced against the goal of making
the whole entropy decoder fast.

No tri-state logic in FPGA
Barrel-shifting can be implemented by cascaded
multiplexers selecting fixed-size shifting by
1,2, 4, ...
Barrel-shifting is expensive in FPGA
Latency
5628 bits (21 codewords) 10.2 ns
8428 bits (31 codewords) 11.9 ns
11256 bits (42 codewords) 15.0 ns
3 TriMedia cycles
Barrel-shifting is cheap in standard TriMedia
1 TriMedia cycle

RFU calls must have fixed latency
RFU latency
1 cycle to read the arguments from register file
delay on FPGA
1 cycle to write back the results to register
file
Maximum 4 input 64-bit arguments
Maximum 2 output 64-bit results

10
VLD-1 on FPGA

Design parameters
Run/Level pair or End-of-Block per execution
Fixed latency

How to fulfill that?
First idea 17-input look-up table

TOO LARGE !
Latency 6 ? 7 TriMedia cycles

Second idea partition the VL codes into groups
in order to allow for smaller look-up tables (LUT)

11
VLD-2 on FPGA

Two symbols per execution - how to do that ?
Trully two codewords at a time 64 GB look-up
table

Latency 7 ? 8 TriMedia cycles

Early work (I)
run, level, code-length for 1st codeword
barrel-shift
run, level, code-length for 2nd codeword

Early work (II)
run, level, code-length for 1st codeword
run, level, code-length for all possible 2nd
codewords
only a selection is carried out of the proper
codeword

VLD-2 new idea !
run, level, code-length for 1st codeword
code-length for all possible 2nd codewords
only a selection is carried out of the proper
code-length
the computation of the run and level for the 2nd
codeword is postponed for the next VLD call
with the exception of a firing-up call, trully 2
codewords per call is achieved for all subsequent
calls

12
VLD-x on FPGA (x ? 3)

VLD-2 principle is scalable and can be extended
to VLD-x (x?3)
VLD-3 two next / previous codewords are
considered
Unfortunately, VLD-x (x?3) seems not to be
feasible
the computation of the code-lengths for two next
codewords is on the critical path
e.g., 12 TriMedia cycles are needed only to
decode the code-lengths for current, and two next
codewords
while the selection of the code-length of the
next codeword in VLD-2 can be completed in about
the same time with run-level decoding for the
current and previous codewords
VLD-2 latency ? VLD-1 latency
Constraints related to TriMedia/CPU64
super-operation format
too many values has to be returned by VLD call
packing difficulties

13
Entropy decoding on extended TriMedia

VLD-1-based entropy decoder
1. Initializations
2. VLD-1 call
3. Field extraction
run, level, code-length, exit flag
4. Updating accumulated code-length
5. Exit if exit condition
6. Run-length decoding
7. Aligning the input string
8. Go to 2.
Stage 6 can be folded into the loop
(software pipelining)

VLD-2-based entropy decoder
1. Initializations
2. VLD-2 call
3. Field extraction run_p, level_p, run_c,
level_c, code-length_c, code-length_n, exit flag
4. Updating accumulated code-length
5. Aligning the input string for previous
codeword
6. Updating accumulated code-length
7. Exit if exit condition
8. Run-length decoding for previous codeword
9. Run-length decoding for current codeword
10. Aligning the input string for current, next
codeword
11. Go to 2.
Stage 9 or both Stages 8 and 9
can be folded into the loop (software pipelining)

WATCH OUT ! Due to the higher complexity, the
overhead associated with loop firing-up may
become significant !
14
Entropy decoding computation overview
On standard TriMedia/CPU64

Hit / Miss mechanism
Up to 3 look-ups are needed to decode a symbol
A single look-up takes minimum 13 cycles
13 ? 39 cycles / iteration

On FPGA-augmented TriMedia/CPU64
VLD-1-based entropy decoding
VLD-2-based entropy decoding

VLD on FPGA (latency 6?7 cycles)
One symbol per call
11 cycles / iteration

VLD on FPGA (latency 7?8 cycles)
Two symbols per call
17 cycles / iteration

Double software pipeline
at the VLD level
at the entropy decoder level

15
Experimental framework

Testing database preprocessed MPEG-conformance
strings, from which all data not representing DCT
coefficients has been removed ? only run-level
and end-of-block symbols
MPEG strings are entirely resident into the main
memory ? side effects like asynchronous
interrupts, trashing routines, other operating
system related tasks do not have to be counted
Zeroing the reconstructed 8?8 matrices are not
counted also ? the run-length decoder overwrites
the same 8?8 matrices again and again.
The only relevant metric number of the
instruction cycles needed to perform strictly
entropy decoding
Two experiment classes
entropy decoding loop is left on End-of-Block
entropy decoding loop is left on
End-of-Macro-Block
Two strategies
VLD returns run
VLD returns non-zero-coefficient-position
relative to block/macro-block