Optimization of H.264 High Profile Decoder for Pentium 4 Processor

About This Presentation

Title:

Optimization of H.264 High Profile Decoder for Pentium 4 Processor

Description:

... Fidelity Range Extensions,' SPIE Conference on Applications of Digital Image ... IA-32 Intel Architecture Optimization, Reference Manual, www.intel.com ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 41

Provided by: TAR108

Learn more at: http://www-ee.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimization of H.264 High Profile Decoder for Pentium 4 Processor

1
Optimization of H.264High Profile Decoder for
Pentium 4 Processor

Tarun Bhatia
University of Texas at Arlington
tarun_at_fastvdo.com

2
H.264 Decoder
Video Output
Bitstream Input

Entropy Decoding
Inverse Transform and Dequantization
Deblocking

Intra/Inter Mode Selection
Picture Buffering
Intra Prediction
Motion Compensation
3
OptimizationNeed

H.264/AVC video coding introduces substantially
more coding tools and coding options than earlier
standards. Therefore, it takes much more
computational complexity to achieve highest
possible coding gain.
Aggressive optimization is typically required in
order to get H.264 implementations to meet cost
and power targets and provide real-time
performance for applications.

4
Sequences Used
Girl.264
Karate.264
Golf.264
Shore.264
Plane.264
5
H.264 Profiles
High Profile
Adaptive Block Size Transform
Perceptual Quantization Matrices
Extended Profile
Main Profile
B slices Weighted Prediction
CABAC
Data Partition
I slice P slice
CAVLC Arbitrary Slice Order
(ASO) Frame Macroblock Ordering (FMO) Redundant
Slices
Baseline Profile
SP Slice
SI Slice
6
H.264 High Profiles - features

Main Profile additional features
8x8 Integer DCT
HVS matrices
8x8 Intra Prediction modes

7
Optimization Levels

Algorithm Level
e.g. DCT implementation
Compiler Level
(Microsoft Visual Studio .NET 2003
/ Intel C compiler v 8.0)
Implementation Level
e.g. Elimination of Loops, Conditions
Using SIMD for implementation
Multithreading

8
Target Platform Pentium 4 ProcessorIntel SIMD
Architecture

8 XMM Registers 128 bits
MXCSR 32 bit
8 MMX Registers 64 bit
8 GPRs 32bit
X87 FP Register File
EFLAGS32bit
FP MMX SSE/SSE2/ SSE3
FP MOVE
L1 Data Cache (8KB 4-way)
9
Intel HT (Hyper Threading) Technology

Purpose Simultaneous Execution of Threads

Architectural State Architectural State
Execution Engine Execution Engine
Local APIC Local APIC
Bus Interface Bus Interface
SYSTEM BUS
10
Optimization Steps

Optimization during code development
Optimization after code development
1) Searching for hotspots in the code
2) Analysis of hotspot
e.g. more number of calls, cache miss,
slower implementation
3) Optimization of hotspots

11
Performance Profiling

Intel VTuneTM Performance Analyzer

12
Intel VTune Performance Analysis - Results
(FastVDO H.264 HD High Profile Decoder)
Time Consumed Time Consumed Time Consumed Time Consumed Time Consumed Time Consumed
Girl.264 Golf.264 Karate.264 Plane.264 Shore.264
IDCT 8x8 4.957879 0.956737 3.2735859 1.894126 1.63884
CABAC 11.026452 5.293592 10.335945 10.01807 6.407274
Memcpy Memset 13.33369 16.86905 11.849307 11.01987 14.59611
IDCT 4x4 17.02527 20.39636 15.568315 12.89757 17.79446
MC 29.53265 38.00137 40.045078 50.16149 41.66314
Others 24.12405 18.48289 18.927766 14.00887 17.90019
13
Distribution of Decoder Time Consumption
14
SIMD

Single Instruction Multiple Data Instructions
Intel Pentium 4
MMX ( Multimedia Extension) from Pentium MMX
onwards
SSE ( Streaming SIMD Extension ) from
Pentium III onwards
SSE2 ( Streaming SIMD Extension 2) from
Pentium IV onwards
AMD Athalon 64
3D Now

15
SIMD Data Types
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
16 16 16 16 16 16 16 16
32 32 32 32
64 64
128
Available in XMM registers in SSE Technology
Available in MMX and XMM registers
16
SIMD Instructions Types

Packed Arithmetic (e.g. padd, pmul)
Packed Logical (e.g. pand, por)
Data Movement and Memory Access (mov)
General Support (pack, unpack)
Packed Shift ( gtgt ,ltlt )
Packed Comparison (lt, )

17
Case Study

interpolation4x4 (pixel_data forward_block,
pixel_data backward_block)
pixel_data result
for (int i0 ilt15 i)
result i
(forward_blocki backward_blocki1)/2

18
MMX Code

interpolation (pixel_data forward_block ,
pixel_data backward_block)
___asm
__asm
pxor mm7,mm7 // set
mm7 to 0
mov EDX, 0x01010101
// EDX 01 01 01 01
mov EAX,
forward_block // Store forward block
starting address
movd mm3, EDX
// mm3 00 00 00 00 01 01 01 01
mov EBX,
backward_block // Store backward block
starting address
punpcklbw mm3,mm7
// mm3 00 01 00 01 00 01 00 01
mov ECX, result
// Store the address of result
movd mm0, EAX
// mm0 fb14
movd mm1, EBX
// mm1 bb14
movd mm4, EAX4
// mm4 fb58
movd mm5, EBX4
// mm5 bb58
punpcklbw mm0,mm7
//
punpcklbw mm1,mm7
//
punpcklbw mm4,mm7
//
punpcklbw mm5,mm7
//

19
SIMD Application Results

Amdahls Law The Overall Speedup (O.S.)
obtained by optimizing a portion p of the program
by a factor s is
O.S. 1
x 100
----------------- - 1
1 p (p/s)
p ? fraction of the code being optimized
s ? speedup factor for that fraction of code

20
Application to IDCT 4x4
Girl.264 Golf.264 Karate.264 Plane.264 Shore.264
NO SIMD () 17.02527 20.39636 15.56831 12.89757 17.79446
SIMD () 8.811405 11.61393 9.16206 7.050434 9.89151
Speedup Factor 1.932186 1.756198 1.69921 1.82993 1.79896
Overall Speedup () 8.9489 9.628 6.8447 6.21 8.581
21
IDCT 4x4 Comparison of Time Consumed Of the
Total Decoding Time
22
Overall Speed up in Decoding Time with SIMD
IDCT4x4
23
Application to Motion Compensation

The implementation of Motion Compensation can be
divided as -
Data Manipulation (SIMD not used)
Interpolation (SIMD used)
Half Pel Interpolation
Quarter Pel Interpolation
Linear Interpolation for B frames

24
Motion Compensation- Time consumption (without
MMX)
25
SIMD Application to Motion Compensation - Results
Girl Golf Karate Plane Shore
NO SIMD () 15.96824 13.02634 23.25319 32.54503 19.7399
SIMD () 9.51832 7.53874 14.32317 19.7414 11.608
Speedup Factor 1.68 1.73 1.62 1.65 1.7
Overall Speedup () 6.89 5.8 9.8 14.68 8.85
26
Motion Compensation ResultsComparison of
Time Consumed
27
Overall Speed up in Decoding Time with SIMD MC
28
Multithreading

Definition Multithreading is the ability of the
program to multitask within itself. The program
can split itself into separate threads of
execution that seem to run concurrently.
Waits are used to block the thread till a
particular event hands over control
Release is use to unblock the thread
Semaphores Locking mechanism / Counters to
control access to shared resources being used by
multiple processes

29
Producer-Consumer Problem (Diagram)

Producer Thread
Consumer Thread
Semaphores
Wait
Serial Execution Of a Thread
Release
30
Producer-Consumer Problem (Algorithm)

Producer thread starts and initialize data
Wait for the Consumer thread
If Consumer thread ready, release control to the
consumer thread
Producer thread completes one execution cycle in
the meantime and waits for Consumer thread
When the control is passed back to Producer
thread, the process is repeated till the end
condition is met.

31
Multithreading in Video Coding

The Codec can be multithreaded in two ways-
Block Level
Independent blocks can be executed as separate
threads e.g. slices in H.264, motion estimation,
deblocking of non-reference frames
GOP Level
Closed GOP Group of frames which will not use
any reference frames except from their GOP
Open GOP Group of frames can use reference
frames from outside their GOP

32
Proposed Multithreading Architecture -features

GOP Level (Closed GOP)
30 frames per GOP
IPPPPPPPP
Each GOP begins with an I frame and contains P
frames only (i.e. 1 I frame and 29 P frames in
each )
B frames are not used in the design to maintain
closed GOP structure

33
Proposed Multithreading Architecture

Get IDR Position
Main Thread
Decoder 0
Decoder 1
Decoder N
34
Multithreaded Decoder - Threads

Main Thread
Creates all threads and semaphores
Get SPS and PPS NALUs from the
Initialize Multiple decoders with SPS and PPS
NALUs
Get IDR Frame Position Thread
Search for IDR NALU Position in the bitstream
Manage Waits and Releases of Semaphores
Decoder Threads
Decode H.264 GOPs

SPS ? Sequence Parameter Set PPS? Picture
Parameter Set NALU ? Network Abstraction Layer
Unit
35
Multithreading - Results Speed up in Decoding
Time
Number of Threads
36
Multithreading-ResultsThreading Overhead (Time
in seconds)
No. of Threads
37
Further Research

Optimization of High Profile HD (720p) Encoder
for minimization of Hardware requirement
Testing of the H.264 encoder and decoder on
multicore CPUs
Implementation of time consuming modules of H.264
encoder and decoder on GPU (Graphic Processing
Unit)

38
References

H.264 International Telecommunication Union,
Recommendation ITU-T H.264 Advanced Video
Coding for Generic Audiovisual Services, ITU-T,
2005.
MPEG-2 ISO/IEC JTC1/SC29/WG11 and ITU-T,
ISO/IEC 13818-2 Information Technology-Generic
Coding of Moving Pictures and Associated Audio
Information Video, ISO/IEC and ITU-T, 1994.
Soon-kak Kwon, A.Tamhankar and K.R.Rao ,Overview
of MPEG-4 Part 10.
G. Sullivan, P. Topiwala and A. Luthra, The
H.264/AVC Advanced Video Coding Standard
Overview and Introduction to the Fidelity Range
Extensions, SPIE Conference on Applications of
Digital Image Processing XXVII, vol 5558 , page
53-74, Aug 2004.
The Software Optimization Cookbook, Intel Press,
2002.
IA-32 Intel Architecture Optimization, Reference
Manual, www.intel.com
Optimization Applications with the Intel C and
FORTRAN compilers, White paper,
http//developer.intel.com/design/pentium4/manuals
/
J.Lee, S.Moon and W.Sun, H.264 Decoder
Optimization Exploiting SIMD Instructions, Seoul
National University. http//sips03.snu.ac.kr/pub/c
onf/c67.pdf Accepted at IEEE Asia-Pacific
Conference on Circuits and Systems, (APCCAS),
December 2004.
Amdahl, G.M. Validity of the single-processor
approach to achieving large scale computing
capabilities. In AFIPS Conference Proceedings
vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS
Press, Reston, Va., 1967, pp. 483-485.
Horowitz, A. Joch, F. Kossentini, and A.
Hallapuro,H.264/AVC Baseline Profile Decoder
Complexity Analysis, IEEE Transactions for
Circuits and Systems for Video Technology,
vol.13, no. 7, pp. 704-716, July 2003.

39
ReferencesContinued

http//www.blu-ray.com/
http//www.hddvd.org/hddvd/
http//www.fastvdo.com
http//www.intel.com
http//www.intel.com/software/products/vtune/
http//msdn.microsoft.com

Thanks!!

Write a Comment

User Comments (0)

About PowerShow.com

Optimization of H.264 High Profile Decoder for Pentium 4 Processor - PowerPoint PPT Presentation

Optimization of H.264 High Profile Decoder for Pentium 4 Processor

... Fidelity Range Extensions,' SPIE Conference on Applications of Digital Image ... IA-32 Intel Architecture Optimization, Reference Manual, www.intel.com ... – PowerPoint PPT presentation