November%201st,%202000

About This Presentation

Title:

November%201st,%202000

Description:

Baseline H.263 Video Encoding ... on data dependencies for parallel (out-of-order) execution ... Parallel assembly: SAD, Clip_MB (clips overflowing values) ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 20

Provided by: dsp5

Learn more at: https://users.ece.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: November%201st,%202000

1
Human beings are great programmers, Computers
are poor actors
VLIW DSP vs. SuperScalar Implementation of
a Baseline H.263 Encoder
Serene Banerjee Hamid R. Sheikh Lizy K.
John Brian L. Evans Alan C. Bovik
Department of Electrical and Computer Engineering
The University of Texas at Austin
November 1st, 2000
serene_at_ece.utexas.edu
2
Baseline H.263 Video Encoding
I Intra frame Discrete Cosine Transform (DCT)
is used to reduce spatial redundancy within a
frame. P Predicted frame Motion compensated
prediction (MCP) used to reduce temporal
redundancy. DCT is used to reduce spatial
redundancy in the prediction error.
3
Baseline H.263 Encoder
4
H.263 Encoder

Goals baseline H.263 encoder only
Evaluate performance of compiled C code on Very
Long Instruction Word (VLIW) Digital Signal
Processors (DSPs) and superscalar processors
Hand optimize H.263 video encoder on VLIW DSP
University of British Columbia (UBC) H.263
Version 2 (H.263) video codec
By Prof. Faouzi Kossentinis group
http//spmg.ece.ubc.ca
23000 lines (720 kbytes) of C code targeted for
PCs
Baseline H.263 and many optional H.263 modes
Primarily for research purposes

5
TMS320C6701 Processor

Up to 8 32-bit instructions are executed in one
instruction cycle in an in-order way
2 32-bit data paths, with 16 32-bit registers and
16 16-bit data memory banks

Program Fetch
Control Registers
Instruction Dispatch
Instruction Decode
Control Logic
A Register File
B Register File
Test/ Emulation
Interrupts control
L1
S1
M1
D1
L2
S2
M2
D2
TMS320C6701 CPU Core
6
TMS320C6701 EVM

TMS320C6701 processor
11 - 17 stages of pipeline, depending on
instruction
External memory
256 kB of 133 MHz synchronous burst static
random-access memory (SBSRAM)
8 MB of 100 MHz synchronous dynamic RAM (SDRAM)
in two 16-bit RAM banks
100 MHz clock speed due to SDRAM
Development environment
Code Composer Interactive real-time debugging
Simulator Does not report pipeline stalls

7
SimpleScalar Simulator

Superscalar processor reorders sequential
instructions based on data dependencies for
parallel (out-of-order) execution
SimpleScalar is configurable superscalar
simulator http//www.simplescalar.org

Fetch
Dispatch
Scheduler
Execute
Writeback
Memory
Memory
TLB Translation lookahead buffer
Commit
Data-TLB
Data cache
Six pipeline stages for out-of-order simulation
8
Comparison of Processors
9
Encoder Profile for VLIW DSP (with level two C
optimization only)
1476 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
SAD
10
Encoder Profile for SuperScalar(1-way with
level two C optimization)
196 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
11
H.263 Encoder Comparison(with level 2 C
optimization only)

Frame resolution 128 x 96 (Sub-QCIF)
Full search motion estimation
Clock speed 100 MHz

12
VLIW DSP Memory Optimizations

Internal program memory holds
Computationally intensive routines
Commonly used runtime support functions from TI
libraries (memcpy, memcmp and memset)
Internal data memory holds
Macroblocks and search area for motion estimation
Macroblocks for DCT, quantization, coding,
reconstruction
Local data for computationally intensive routines
Stack
Speedup 29 times over level two optimization

13
VLIW DSP Code Optimizations

Compiler intrinsics gave little improvement
Wrote assembly routines
Parallel assembly SAD, Clip_MB (clips
overflowing values)
Linear assembly Interpolate, FillMBData (pack
copy of pixel data into macroblock structures)
Rewriting the C code
Unroll loops and pipeline computations
Use 32-bit packed data I/O to slower external RAM
Avoid pipeline stalls due to memory bank
conflicts
Speedup 4 times over level two C optimization

14
VLIW DSP Optimizations (assembly routines
only)
15
VLIW DSP Encoder Profile(after all C6701
optimizations)
24 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
SAD
16
Superscalar Encoder Profile(256-way
SimpleScalar processor)
28 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
17
Subroutine Comparisons
18
H.263 Encoder Comparison