Title: Optimization of H.264 High Profile Decoder for Pentium 4 Processor
1Optimization of H.264High Profile Decoder for
Pentium 4 Processor
- Tarun Bhatia
- University of Texas at Arlington
- tarun_at_fastvdo.com
2H.264 Decoder
Video Output
Bitstream Input
Entropy Decoding
Inverse Transform and Dequantization
Deblocking
Intra/Inter Mode Selection
Picture Buffering
Intra Prediction
Motion Compensation
3OptimizationNeed
- H.264/AVC video coding introduces substantially
more coding tools and coding options than earlier
standards. Therefore, it takes much more
computational complexity to achieve highest
possible coding gain. - Aggressive optimization is typically required in
order to get H.264 implementations to meet cost
and power targets and provide real-time
performance for applications.
4Sequences Used
Girl.264
Karate.264
Golf.264
Shore.264
Plane.264
5H.264 Profiles
High Profile
Adaptive Block Size Transform
Perceptual Quantization Matrices
Extended Profile
Main Profile
B slices Weighted Prediction
CABAC
Data Partition
I slice P slice
CAVLC Arbitrary Slice Order
(ASO) Frame Macroblock Ordering (FMO) Redundant
Slices
Baseline Profile
SP Slice
SI Slice
6H.264 High Profiles - features
- Main Profile additional features
- 8x8 Integer DCT
- HVS matrices
- 8x8 Intra Prediction modes
7Optimization Levels
- Algorithm Level
- e.g. DCT implementation
- Compiler Level
- (Microsoft Visual Studio .NET 2003
- / Intel C compiler v 8.0)
- Implementation Level
- e.g. Elimination of Loops, Conditions
- Using SIMD for implementation
- Multithreading
8Target Platform Pentium 4 ProcessorIntel SIMD
Architecture
8 XMM Registers 128 bits
MXCSR 32 bit
8 MMX Registers 64 bit
8 GPRs 32bit
X87 FP Register File
EFLAGS32bit
FP MMX SSE/SSE2/ SSE3
FP MOVE
L1 Data Cache (8KB 4-way)
9Intel HT (Hyper Threading) Technology
- Purpose Simultaneous Execution of Threads
Architectural State Architectural State
Execution Engine Execution Engine
Local APIC Local APIC
Bus Interface Bus Interface
SYSTEM BUS
10Optimization Steps
- Optimization during code development
- Optimization after code development
- 1) Searching for hotspots in the code
- 2) Analysis of hotspot
- e.g. more number of calls, cache miss,
slower implementation - 3) Optimization of hotspots
-
11Performance Profiling
- Intel VTuneTM Performance Analyzer
12Intel VTune Performance Analysis - Results
(FastVDO H.264 HD High Profile Decoder)
Time Consumed Time Consumed Time Consumed Time Consumed Time Consumed Time Consumed
Girl.264 Golf.264 Karate.264 Plane.264 Shore.264
IDCT 8x8 4.957879 0.956737 3.2735859 1.894126 1.63884
CABAC 11.026452 5.293592 10.335945 10.01807 6.407274
Memcpy Memset 13.33369 16.86905 11.849307 11.01987 14.59611
IDCT 4x4 17.02527 20.39636 15.568315 12.89757 17.79446
MC 29.53265 38.00137 40.045078 50.16149 41.66314
Others 24.12405 18.48289 18.927766 14.00887 17.90019
13Distribution of Decoder Time Consumption
14SIMD
- Single Instruction Multiple Data Instructions
- Intel Pentium 4
- MMX ( Multimedia Extension) from Pentium MMX
onwards - SSE ( Streaming SIMD Extension ) from
Pentium III onwards - SSE2 ( Streaming SIMD Extension 2) from
Pentium IV onwards - AMD Athalon 64
- 3D Now
15SIMD Data Types
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
16 16 16 16 16 16 16 16
32 32 32 32
64 64
128
Available in XMM registers in SSE Technology
Available in MMX and XMM registers
16SIMD Instructions Types
- Packed Arithmetic (e.g. padd, pmul)
- Packed Logical (e.g. pand, por)
- Data Movement and Memory Access (mov)
- General Support (pack, unpack)
- Packed Shift ( gtgt ,ltlt )
- Packed Comparison (lt, )
17Case Study
- interpolation4x4 (pixel_data forward_block,
pixel_data backward_block) -
- pixel_data result
- for (int i0 ilt15 i)
-
- result i
(forward_blocki backward_blocki1)/2 -
-
-
18MMX Code
- interpolation (pixel_data forward_block ,
pixel_data backward_block) -
- ___asm
- __asm
- pxor mm7,mm7 // set
mm7 to 0 - mov EDX, 0x01010101
// EDX 01 01 01 01 - mov EAX,
forward_block // Store forward block
starting address - movd mm3, EDX
// mm3 00 00 00 00 01 01 01 01 - mov EBX,
backward_block // Store backward block
starting address - punpcklbw mm3,mm7
// mm3 00 01 00 01 00 01 00 01 - mov ECX, result
// Store the address of result - movd mm0, EAX
// mm0 fb14 - movd mm1, EBX
// mm1 bb14 - movd mm4, EAX4
// mm4 fb58 - movd mm5, EBX4
// mm5 bb58 - punpcklbw mm0,mm7
// - punpcklbw mm1,mm7
// - punpcklbw mm4,mm7
// - punpcklbw mm5,mm7
//
19SIMD Application Results
- Amdahls Law The Overall Speedup (O.S.)
obtained by optimizing a portion p of the program
by a factor s is -
- O.S. 1
x 100 - ----------------- - 1
- 1 p (p/s)
- p ? fraction of the code being optimized
- s ? speedup factor for that fraction of code
20Application to IDCT 4x4
Girl.264 Golf.264 Karate.264 Plane.264 Shore.264
NO SIMD () 17.02527 20.39636 15.56831 12.89757 17.79446
SIMD () 8.811405 11.61393 9.16206 7.050434 9.89151
Speedup Factor 1.932186 1.756198 1.69921 1.82993 1.79896
Overall Speedup () 8.9489 9.628 6.8447 6.21 8.581
21IDCT 4x4 Comparison of Time Consumed Of the
Total Decoding Time
22 Overall Speed up in Decoding Time with SIMD
IDCT4x4
23Application to Motion Compensation
- The implementation of Motion Compensation can be
divided as - - Data Manipulation (SIMD not used)
-
- Interpolation (SIMD used)
- Half Pel Interpolation
- Quarter Pel Interpolation
- Linear Interpolation for B frames
-
24Motion Compensation- Time consumption (without
MMX)
25SIMD Application to Motion Compensation - Results
Girl Golf Karate Plane Shore
NO SIMD () 15.96824 13.02634 23.25319 32.54503 19.7399
SIMD () 9.51832 7.53874 14.32317 19.7414 11.608
Speedup Factor 1.68 1.73 1.62 1.65 1.7
Overall Speedup () 6.89 5.8 9.8 14.68 8.85
26Motion Compensation ResultsComparison of
Time Consumed
27 Overall Speed up in Decoding Time with SIMD MC
28Multithreading
- Definition Multithreading is the ability of the
program to multitask within itself. The program
can split itself into separate threads of
execution that seem to run concurrently. - Waits are used to block the thread till a
particular event hands over control - Release is use to unblock the thread
- Semaphores Locking mechanism / Counters to
control access to shared resources being used by
multiple processes
29Producer-Consumer Problem (Diagram)
Producer Thread
Consumer Thread
Semaphores
Wait
Serial Execution Of a Thread
Release
30Producer-Consumer Problem (Algorithm)
- Producer thread starts and initialize data
- Wait for the Consumer thread
- If Consumer thread ready, release control to the
consumer thread - Producer thread completes one execution cycle in
the meantime and waits for Consumer thread - When the control is passed back to Producer
thread, the process is repeated till the end
condition is met.
31Multithreading in Video Coding
- The Codec can be multithreaded in two ways-
- Block Level
- Independent blocks can be executed as separate
threads e.g. slices in H.264, motion estimation,
deblocking of non-reference frames - GOP Level
- Closed GOP Group of frames which will not use
any reference frames except from their GOP - Open GOP Group of frames can use reference
frames from outside their GOP
32Proposed Multithreading Architecture -features
- GOP Level (Closed GOP)
- 30 frames per GOP
- IPPPPPPPP
- Each GOP begins with an I frame and contains P
frames only (i.e. 1 I frame and 29 P frames in
each ) - B frames are not used in the design to maintain
closed GOP structure
33Proposed Multithreading Architecture
Get IDR Position
Main Thread
Decoder 0
Decoder 1
Decoder N
34Multithreaded Decoder - Threads
- Main Thread
- Creates all threads and semaphores
- Get SPS and PPS NALUs from the
- Initialize Multiple decoders with SPS and PPS
NALUs - Get IDR Frame Position Thread
- Search for IDR NALU Position in the bitstream
- Manage Waits and Releases of Semaphores
- Decoder Threads
- Decode H.264 GOPs
SPS ? Sequence Parameter Set PPS? Picture
Parameter Set NALU ? Network Abstraction Layer
Unit
35Multithreading - Results Speed up in Decoding
Time
Number of Threads
36Multithreading-ResultsThreading Overhead (Time
in seconds)
No. of Threads
37Further Research
- Optimization of High Profile HD (720p) Encoder
for minimization of Hardware requirement - Testing of the H.264 encoder and decoder on
multicore CPUs - Implementation of time consuming modules of H.264
encoder and decoder on GPU (Graphic Processing
Unit) -
-
38References
- H.264 International Telecommunication Union,
Recommendation ITU-T H.264 Advanced Video
Coding for Generic Audiovisual Services, ITU-T,
2005. - MPEG-2 ISO/IEC JTC1/SC29/WG11 and ITU-T,
ISO/IEC 13818-2 Information Technology-Generic
Coding of Moving Pictures and Associated Audio
Information Video, ISO/IEC and ITU-T, 1994. - Soon-kak Kwon, A.Tamhankar and K.R.Rao ,Overview
of MPEG-4 Part 10. - G. Sullivan, P. Topiwala and A. Luthra, The
H.264/AVC Advanced Video Coding Standard
Overview and Introduction to the Fidelity Range
Extensions, SPIE Conference on Applications of
Digital Image Processing XXVII, vol 5558 , page
53-74, Aug 2004. - The Software Optimization Cookbook, Intel Press,
2002. - IA-32 Intel Architecture Optimization, Reference
Manual, www.intel.com - Optimization Applications with the Intel C and
FORTRAN compilers, White paper,
http//developer.intel.com/design/pentium4/manuals
/ - J.Lee, S.Moon and W.Sun, H.264 Decoder
Optimization Exploiting SIMD Instructions, Seoul
National University. http//sips03.snu.ac.kr/pub/c
onf/c67.pdf Accepted at IEEE Asia-Pacific
Conference on Circuits and Systems, (APCCAS),
December 2004. - Amdahl, G.M. Validity of the single-processor
approach to achieving large scale computing
capabilities. In AFIPS Conference Proceedings
vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS
Press, Reston, Va., 1967, pp. 483-485. - Horowitz, A. Joch, F. Kossentini, and A.
Hallapuro,H.264/AVC Baseline Profile Decoder
Complexity Analysis, IEEE Transactions for
Circuits and Systems for Video Technology,
vol.13, no. 7, pp. 704-716, July 2003.
39ReferencesContinued
- http//www.blu-ray.com/
- http//www.hddvd.org/hddvd/
- http//www.fastvdo.com
- http//www.intel.com
- http//www.intel.com/software/products/vtune/
- http//msdn.microsoft.com
40