Optimization of H.264 High Profile Decoder for Pentium 4 Processor - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Optimization of H.264 High Profile Decoder for Pentium 4 Processor

Description:

... Fidelity Range Extensions,' SPIE Conference on Applications of Digital Image ... IA-32 Intel Architecture Optimization, Reference Manual, www.intel.com ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 41
Provided by: TAR108
Learn more at: http://www-ee.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Optimization of H.264 High Profile Decoder for Pentium 4 Processor


1
Optimization of H.264High Profile Decoder for
Pentium 4 Processor
  • Tarun Bhatia
  • University of Texas at Arlington
  • tarun_at_fastvdo.com

2
H.264 Decoder
Video Output
Bitstream Input

Entropy Decoding
Inverse Transform and Dequantization
Deblocking

Intra/Inter Mode Selection
Picture Buffering
Intra Prediction
Motion Compensation
3
OptimizationNeed
  • H.264/AVC video coding introduces substantially
    more coding tools and coding options than earlier
    standards. Therefore, it takes much more
    computational complexity to achieve highest
    possible coding gain.
  • Aggressive optimization is typically required in
    order to get H.264 implementations to meet cost
    and power targets and provide real-time
    performance for applications.

4
Sequences Used
Girl.264
Karate.264
Golf.264
Shore.264
Plane.264
5
H.264 Profiles
High Profile
Adaptive Block Size Transform
Perceptual Quantization Matrices
Extended Profile
Main Profile
B slices Weighted Prediction
CABAC
Data Partition
I slice P slice
CAVLC Arbitrary Slice Order
(ASO) Frame Macroblock Ordering (FMO) Redundant
Slices
Baseline Profile
SP Slice
SI Slice
6
H.264 High Profiles - features
  • Main Profile additional features
  • 8x8 Integer DCT
  • HVS matrices
  • 8x8 Intra Prediction modes

7
Optimization Levels
  • Algorithm Level
  • e.g. DCT implementation
  • Compiler Level
  • (Microsoft Visual Studio .NET 2003
  • / Intel C compiler v 8.0)
  • Implementation Level
  • e.g. Elimination of Loops, Conditions
  • Using SIMD for implementation
  • Multithreading

8
Target Platform Pentium 4 ProcessorIntel SIMD
Architecture

8 XMM Registers 128 bits
MXCSR 32 bit
8 MMX Registers 64 bit
8 GPRs 32bit
X87 FP Register File
EFLAGS32bit
FP MMX SSE/SSE2/ SSE3
FP MOVE
L1 Data Cache (8KB 4-way)
9
Intel HT (Hyper Threading) Technology
  • Purpose Simultaneous Execution of Threads

Architectural State Architectural State
Execution Engine Execution Engine
Local APIC Local APIC
Bus Interface Bus Interface
SYSTEM BUS
10
Optimization Steps
  • Optimization during code development
  • Optimization after code development
  • 1) Searching for hotspots in the code
  • 2) Analysis of hotspot
  • e.g. more number of calls, cache miss,
    slower implementation
  • 3) Optimization of hotspots

11
Performance Profiling
  • Intel VTuneTM Performance Analyzer

12
Intel VTune Performance Analysis - Results
(FastVDO H.264 HD High Profile Decoder)
Time Consumed Time Consumed Time Consumed Time Consumed Time Consumed Time Consumed
Girl.264 Golf.264 Karate.264 Plane.264 Shore.264
IDCT 8x8 4.957879 0.956737 3.2735859 1.894126 1.63884
CABAC 11.026452 5.293592 10.335945 10.01807 6.407274
Memcpy Memset 13.33369 16.86905 11.849307 11.01987 14.59611
IDCT 4x4 17.02527 20.39636 15.568315 12.89757 17.79446
MC 29.53265 38.00137 40.045078 50.16149 41.66314
Others 24.12405 18.48289 18.927766 14.00887 17.90019
13
Distribution of Decoder Time Consumption
14
SIMD
  • Single Instruction Multiple Data Instructions
  • Intel Pentium 4
  • MMX ( Multimedia Extension) from Pentium MMX
    onwards
  • SSE ( Streaming SIMD Extension ) from
    Pentium III onwards
  • SSE2 ( Streaming SIMD Extension 2) from
    Pentium IV onwards
  • AMD Athalon 64
  • 3D Now

15
SIMD Data Types
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
16 16 16 16 16 16 16 16
32 32 32 32
64 64
128
Available in XMM registers in SSE Technology
Available in MMX and XMM registers
16
SIMD Instructions Types
  • Packed Arithmetic (e.g. padd, pmul)
  • Packed Logical (e.g. pand, por)
  • Data Movement and Memory Access (mov)
  • General Support (pack, unpack)
  • Packed Shift ( gtgt ,ltlt )
  • Packed Comparison (lt, )

17
Case Study
  • interpolation4x4 (pixel_data forward_block,
    pixel_data backward_block)
  • pixel_data result
  • for (int i0 ilt15 i)
  • result i
    (forward_blocki backward_blocki1)/2

18
MMX Code
  • interpolation (pixel_data forward_block ,
    pixel_data backward_block)
  • ___asm
  • __asm
  • pxor mm7,mm7 // set
    mm7 to 0
  • mov EDX, 0x01010101
    // EDX 01 01 01 01
  • mov EAX,
    forward_block // Store forward block
    starting address
  • movd mm3, EDX
    // mm3 00 00 00 00 01 01 01 01
  • mov EBX,
    backward_block // Store backward block
    starting address
  • punpcklbw mm3,mm7
    // mm3 00 01 00 01 00 01 00 01
  • mov ECX, result
    // Store the address of result
  • movd mm0, EAX
    // mm0 fb14
  • movd mm1, EBX
    // mm1 bb14
  • movd mm4, EAX4
    // mm4 fb58
  • movd mm5, EBX4
    // mm5 bb58
  • punpcklbw mm0,mm7
    //
  • punpcklbw mm1,mm7
    //
  • punpcklbw mm4,mm7
    //
  • punpcklbw mm5,mm7
    //

19
SIMD Application Results
  • Amdahls Law The Overall Speedup (O.S.)
    obtained by optimizing a portion p of the program
    by a factor s is
  • O.S. 1
    x 100
  • ----------------- - 1
  • 1 p (p/s)
  • p ? fraction of the code being optimized
  • s ? speedup factor for that fraction of code

20
Application to IDCT 4x4
Girl.264 Golf.264 Karate.264 Plane.264 Shore.264
NO SIMD () 17.02527 20.39636 15.56831 12.89757 17.79446
SIMD () 8.811405 11.61393 9.16206 7.050434 9.89151
Speedup Factor 1.932186 1.756198 1.69921 1.82993 1.79896
Overall Speedup () 8.9489 9.628 6.8447 6.21 8.581
21
IDCT 4x4 Comparison of Time Consumed Of the
Total Decoding Time
22
Overall Speed up in Decoding Time with SIMD
IDCT4x4
23
Application to Motion Compensation
  • The implementation of Motion Compensation can be
    divided as -
  • Data Manipulation (SIMD not used)
  • Interpolation (SIMD used)
  • Half Pel Interpolation
  • Quarter Pel Interpolation
  • Linear Interpolation for B frames

24
Motion Compensation- Time consumption (without
MMX)
25
SIMD Application to Motion Compensation - Results
Girl Golf Karate Plane Shore
NO SIMD () 15.96824 13.02634 23.25319 32.54503 19.7399
SIMD () 9.51832 7.53874 14.32317 19.7414 11.608
Speedup Factor 1.68 1.73 1.62 1.65 1.7
Overall Speedup () 6.89 5.8 9.8 14.68 8.85
26
Motion Compensation ResultsComparison of
Time Consumed
27
Overall Speed up in Decoding Time with SIMD MC
28
Multithreading
  • Definition Multithreading is the ability of the
    program to multitask within itself. The program
    can split itself into separate threads of
    execution that seem to run concurrently.
  • Waits are used to block the thread till a
    particular event hands over control
  • Release is use to unblock the thread
  • Semaphores Locking mechanism / Counters to
    control access to shared resources being used by
    multiple processes

29
Producer-Consumer Problem (Diagram)

Producer Thread
Consumer Thread
Semaphores
Wait
Serial Execution Of a Thread
Release
30
Producer-Consumer Problem (Algorithm)
  • Producer thread starts and initialize data
  • Wait for the Consumer thread
  • If Consumer thread ready, release control to the
    consumer thread
  • Producer thread completes one execution cycle in
    the meantime and waits for Consumer thread
  • When the control is passed back to Producer
    thread, the process is repeated till the end
    condition is met.

31
Multithreading in Video Coding
  • The Codec can be multithreaded in two ways-
  • Block Level
  • Independent blocks can be executed as separate
    threads e.g. slices in H.264, motion estimation,
    deblocking of non-reference frames
  • GOP Level
  • Closed GOP Group of frames which will not use
    any reference frames except from their GOP
  • Open GOP Group of frames can use reference
    frames from outside their GOP

32
Proposed Multithreading Architecture -features
  • GOP Level (Closed GOP)
  • 30 frames per GOP
  • IPPPPPPPP
  • Each GOP begins with an I frame and contains P
    frames only (i.e. 1 I frame and 29 P frames in
    each )
  • B frames are not used in the design to maintain
    closed GOP structure

33
Proposed Multithreading Architecture


Get IDR Position
Main Thread
Decoder 0
Decoder 1
Decoder N
34
Multithreaded Decoder - Threads
  • Main Thread
  • Creates all threads and semaphores
  • Get SPS and PPS NALUs from the
  • Initialize Multiple decoders with SPS and PPS
    NALUs
  • Get IDR Frame Position Thread
  • Search for IDR NALU Position in the bitstream
  • Manage Waits and Releases of Semaphores
  • Decoder Threads
  • Decode H.264 GOPs

SPS ? Sequence Parameter Set PPS? Picture
Parameter Set NALU ? Network Abstraction Layer
Unit
35
Multithreading - Results Speed up in Decoding
Time
Number of Threads
36
Multithreading-ResultsThreading Overhead (Time
in seconds)
No. of Threads
37
Further Research
  • Optimization of High Profile HD (720p) Encoder
    for minimization of Hardware requirement
  • Testing of the H.264 encoder and decoder on
    multicore CPUs
  • Implementation of time consuming modules of H.264
    encoder and decoder on GPU (Graphic Processing
    Unit)

38
References
  • H.264 International Telecommunication Union,
    Recommendation ITU-T H.264 Advanced Video
    Coding for Generic Audiovisual Services, ITU-T,
    2005.
  • MPEG-2 ISO/IEC JTC1/SC29/WG11 and ITU-T,
    ISO/IEC 13818-2 Information Technology-Generic
    Coding of Moving Pictures and Associated Audio
    Information Video, ISO/IEC and ITU-T, 1994.
  • Soon-kak Kwon, A.Tamhankar and K.R.Rao ,Overview
    of MPEG-4 Part 10.
  • G. Sullivan, P. Topiwala and A. Luthra, The
    H.264/AVC Advanced Video Coding Standard
    Overview and Introduction to the Fidelity Range
    Extensions, SPIE Conference on Applications of
    Digital Image Processing XXVII, vol 5558 , page
    53-74, Aug 2004.
  • The Software Optimization Cookbook, Intel Press,
    2002.
  • IA-32 Intel Architecture Optimization, Reference
    Manual, www.intel.com
  • Optimization Applications with the Intel C and
    FORTRAN compilers, White paper,
    http//developer.intel.com/design/pentium4/manuals
    /
  • J.Lee, S.Moon and W.Sun, H.264 Decoder
    Optimization Exploiting SIMD Instructions, Seoul
    National University. http//sips03.snu.ac.kr/pub/c
    onf/c67.pdf Accepted at IEEE Asia-Pacific
    Conference on Circuits and Systems, (APCCAS),
    December 2004.
  • Amdahl, G.M. Validity of the single-processor
    approach to achieving large scale computing
    capabilities. In AFIPS Conference Proceedings
    vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS
    Press, Reston, Va., 1967, pp. 483-485.
  • Horowitz, A. Joch, F. Kossentini, and A.
    Hallapuro,H.264/AVC Baseline Profile Decoder
    Complexity Analysis, IEEE Transactions for
    Circuits and Systems for Video Technology,
    vol.13, no. 7, pp. 704-716, July 2003.

39
ReferencesContinued
  • http//www.blu-ray.com/
  • http//www.hddvd.org/hddvd/
  • http//www.fastvdo.com
  • http//www.intel.com
  • http//www.intel.com/software/products/vtune/
  • http//msdn.microsoft.com

40
  • Thanks!!
Write a Comment
User Comments (0)
About PowerShow.com