November%201st,%202000 - PowerPoint PPT Presentation

About This Presentation
Title:

November%201st,%202000

Description:

Baseline H.263 Video Encoding ... on data dependencies for parallel (out-of-order) execution ... Parallel assembly: SAD, Clip_MB (clips overflowing values) ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 20
Provided by: dsp5
Category:

less

Transcript and Presenter's Notes

Title: November%201st,%202000


1
Human beings are great programmers, Computers
are poor actors
VLIW DSP vs. SuperScalar Implementation of
a Baseline H.263 Encoder
Serene Banerjee Hamid R. Sheikh Lizy K.
John Brian L. Evans Alan C. Bovik
Department of Electrical and Computer Engineering
The University of Texas at Austin
November 1st, 2000
serene_at_ece.utexas.edu
2
Baseline H.263 Video Encoding
I Intra frame Discrete Cosine Transform (DCT)
is used to reduce spatial redundancy within a
frame. P Predicted frame Motion compensated
prediction (MCP) used to reduce temporal
redundancy. DCT is used to reduce spatial
redundancy in the prediction error.
3
Baseline H.263 Encoder
4
H.263 Encoder
  • Goals baseline H.263 encoder only
  • Evaluate performance of compiled C code on Very
    Long Instruction Word (VLIW) Digital Signal
    Processors (DSPs) and superscalar processors
  • Hand optimize H.263 video encoder on VLIW DSP
  • University of British Columbia (UBC) H.263
    Version 2 (H.263) video codec
  • By Prof. Faouzi Kossentinis group
    http//spmg.ece.ubc.ca
  • 23000 lines (720 kbytes) of C code targeted for
    PCs
  • Baseline H.263 and many optional H.263 modes
  • Primarily for research purposes

5
TMS320C6701 Processor
  • Up to 8 32-bit instructions are executed in one
    instruction cycle in an in-order way
  • 2 32-bit data paths, with 16 32-bit registers and
    16 16-bit data memory banks

Program Fetch
Control Registers
Instruction Dispatch
Instruction Decode
Control Logic
A Register File
B Register File
Test/ Emulation
Interrupts control
L1
S1
M1
D1
L2
S2
M2
D2
TMS320C6701 CPU Core
6
TMS320C6701 EVM
  • TMS320C6701 processor
  • 11 - 17 stages of pipeline, depending on
    instruction
  • External memory
  • 256 kB of 133 MHz synchronous burst static
    random-access memory (SBSRAM)
  • 8 MB of 100 MHz synchronous dynamic RAM (SDRAM)
    in two 16-bit RAM banks
  • 100 MHz clock speed due to SDRAM
  • Development environment
  • Code Composer Interactive real-time debugging
  • Simulator Does not report pipeline stalls

7
SimpleScalar Simulator
  • Superscalar processor reorders sequential
    instructions based on data dependencies for
    parallel (out-of-order) execution
  • SimpleScalar is configurable superscalar
    simulator http//www.simplescalar.org

Fetch
Dispatch
Scheduler
Execute
Writeback
Memory
Memory
TLB Translation lookahead buffer
Commit
Data-TLB
Data cache
Six pipeline stages for out-of-order simulation
8
Comparison of Processors
9
Encoder Profile for VLIW DSP (with level two C
optimization only)
1476 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
SAD
10
Encoder Profile for SuperScalar(1-way with
level two C optimization)
196 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
11
H.263 Encoder Comparison(with level 2 C
optimization only)
  • Frame resolution 128 x 96 (Sub-QCIF)
  • Full search motion estimation
  • Clock speed 100 MHz

12
VLIW DSP Memory Optimizations
  • Internal program memory holds
  • Computationally intensive routines
  • Commonly used runtime support functions from TI
    libraries (memcpy, memcmp and memset)
  • Internal data memory holds
  • Macroblocks and search area for motion estimation
  • Macroblocks for DCT, quantization, coding,
    reconstruction
  • Local data for computationally intensive routines
  • Stack
  • Speedup 29 times over level two optimization

13
VLIW DSP Code Optimizations
  • Compiler intrinsics gave little improvement
  • Wrote assembly routines
  • Parallel assembly SAD, Clip_MB (clips
    overflowing values)
  • Linear assembly Interpolate, FillMBData (pack
    copy of pixel data into macroblock structures)
  • Rewriting the C code
  • Unroll loops and pipeline computations
  • Use 32-bit packed data I/O to slower external RAM
  • Avoid pipeline stalls due to memory bank
    conflicts
  • Speedup 4 times over level two C optimization

14
VLIW DSP Optimizations (assembly routines
only)
15
VLIW DSP Encoder Profile(after all C6701
optimizations)
24 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
SAD
16
Superscalar Encoder Profile(256-way
SimpleScalar processor)
28 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
17
Subroutine Comparisons
18
H.263 Encoder Comparison
  • Frame resolution 128 x 96 (Sub-QCIF)
  • Full search motion estimation
  • Clock speed 100 MHz

19
Conclusions
  • With level 2 optimization only
  • One-way superscalar is 7.5x faster than VLIW DSP
  • Four-way to one-way issue speedup is 2.88x
  • 256-way to four-way speedup is 2.4x
  • Variable length coding much faster on superscalar
  • VLIW DSP hand optimization produces 61x speedup
    vs. level two C optimization
  • Placement of often-used data and code on-chip
  • Hand coded SAD, interpolation, and reconstruction
  • 14 faster than 256-way superscalar version

http//www.ece.utexas.edu/sheikh/h263
Write a Comment
User Comments (0)
About PowerShow.com