Title: ThreadParallel MPEG2, MPEG4 and H.264 Video Encoders for SoC MultiProcessor Architecture
1Thread-Parallel MPEG-2, MPEG4 and H.264 Video
Encoders for SoC Multi-Processor Architecture
- Tom R. Jacobs, Vassilios A. Chouliars,
- and David J. Mulvaney
IEEE Transactions on Consumer Electronics
2Outline
- Introduction
- Background knowledge
- Main purpose
- Previous work
- Methodology
- Experimental results
- Conclusions
3IntroductionBackground Knowledge (1/5)
- A number of lossy video compression standards
have been developed. - MPEG-1, MPEG-2, MPEG4-PART2, H.264
- In order to maintain image quality and reduce
bit-rates -
Additional computation and power consumption
4IntroductionBackground Knowledge (2/5)
- Such processing-intense consumer application
algorithms are generally implemented in
System-On-Chip (SOC) devices. - Parallelism
- DLP ? Data-Level Parallelism
- TLP ? Thread-Level Parallelism
-
5IntroductionBackground Knowledge (3/5)
- Data-Level Parallelism (DLP)
- Distributing the data across different parallel
processing nodes.
Program if CPU"a" then low_limit1
upper_limit5 else if CPU"b" then
low_limit6 upper_limit10 end if do i
low_limit , upper_limit Task on d(i) end do
... end program
6IntroductionBackground Knowledge (4/5)
Processing node
Processing node
1
2
7
10
3
4
5
6
8
9
Data array D of size 10
7IntroductionBackground Knowledge (5/5)
- Thread-Level Parallelism (TLP)
- TLP is the parallelism inherent in an application
that runs multiple threads at once. - Benefit-
- Distributing the workload of a single
high-performance processor among a number of
slower and simpler processor cores.
8IntroductionMain Purpose (1/2)
- Utilizing Thread-Level Parallel (TLP) techniques
to improve the performance on video coding. - Reduce DIC (Dynamic Instruction Count).
- How to improve?
- Workload distribution among a number of
parallel-executing processors.
9IntroductionMain Purpose (2/2)
- The results presented demonstrate that reductions
in dynamic instruction count can be achieved.
10Previous Work
- The majority of this research is focused on
coarse-granularity TLP exploitation, with
distribution the workload most commonly at GOP
level.
Little inter-node communication
Multi-threading
GOP
GOP
GOP
GOP
GOP
GOP
11Previous Work
- In 1995, K. Shen, L. A. Rowe, and E.J. Delp
implemented parallel MPEG-1 at GOP level. - In 1996, S. Bozoki, S. J. P. Westen, R. L.
Lagendijk and J. Biemond performed a comparison
between GOP and slice level on MPEG-1.
12Previous Work
- In 1997, A. Bilas, J. Fritts and J. P. Singh
evaluated the performance of MPEG-2 decoders
using shared memory system. - Akramullah, Ahmad and Liou implemented a threaded
MPEG-2 encoder at the MB level by using local
memory.
13MethodologyOverview
- The threaded MPEG-2 , MPEG-4 and H.264
implemented were compiled on multi-context
instruction simulator (MT-ISS) based on
SimpleScalar infrastructure. - The most important issue
- Data dependancies between processors.
- Avoid race hazards.
14MethodologyRace hazards
Expected condition
Error condition
Thread 1
Thread 2
Thread 1
Thread 2
0
1
1
2
0
0
1
1
i1
i1
i1
i1
Race hazards
0
1
2
0
1
1
Integer i
Integer i
15MethodologyThread-parallel MPEG-2 (1/5)
- Test model 5 (TM5) of MPEG-2 encoder is used.
- Computation analysis (QCIF)
- DIST1 ? 5273 of total DIC for a search window
of 6 to 62 pels respectively. - FullSearch ? 3.523.2 of total DIC.
- Can be improved by less complex algorithmic ME
method. (such as 3-step, 4-step, diamond) - FDCT, and IDCT ? 2.121 of total DIC.
16MethodologyThread-parallel MPEG-2 (2/5)
17MethodologyThread-parallel MPEG-2 (3/5)
- Motion Estimation
- Kernel implementation can take advantage of data
parallel techniques. - Store the information in mbinfo structure for
motion compensation. - Maintain exclusivity of all variables during the
parallel sections.
18MethodologyThread-parallel MPEG-2 (4/5)
- Forward transform
- FDCT first scans the MBs on a row-by-row basis,
process these MBs in a row individually. - Determine prediction error and applies the DCT to
the block. - Thread-parallel transform function can be
performed in block-level.
19MethodologyThread-parallel MPEG-2 (5/5)
- Inverse transform
- IDCT scans the MBs first row-by-row and then
block-by-block. - Due to the absence of data dependencies between
blocks - ? Can executed as parallel.
20MethodologyThread-parallel MPEG-4 (1/8)
- The implementation is based on XviD project with
Advanced Simple Profile (ASP). - Bidirectional frames
- Quarter-pel motion compensation
- Global motion compensation
- Trellis quantization
- Custom quantization matrices
21MethodologyThread-parallel MPEG-4 (2/8)
- Computation analysis (QCIF)
22MethodologyThread-parallel MPEG-4 (3/8)
- The nature of XivD encoder
- Intra-frame encoding
- Inter-frame encoding
23MethodologyThread-parallel MPEG-4 (4/8)
- Intra-frame encoding
- FrameCodeI (row-by-row for each MBs)
- Parallelize the loop for encoding the MBs in a
row of the image. - MB data structure ? pMB.
- Shared memory array.
- The highest DIC metric in FrameCodeI is
MBTransQuantIntra.
24MethodologyThread-parallel MPEG-4 (5/8)
- MBTransQuantIntra
- Forward transformation, quantization and inverse
transformation. - Shared data structure ? pEnc
- Includes a count of quantization values.
- Serial code section.
- Transform specific MB pixel data into the
frequency domain independently. - MBPrediction and MBCoding
- Responsible for VLC and write to bitstream.
25MethodologyThread-parallel MPEG-4 (6/8)
- Inter-frame encoding
- FrameCodeP
- Part 1
- ? Motion Estimation
- Part 2
- ? Transformation
- ? Quantization
- ? MC
-
26MethodologyThread-parallel MPEG-4 (7/8)
- Motion Estimation
- Determine a MV for every MB and applies certain
criteria to indicate when Intra coding should be
used. - Scanning in raster line order.
- Two kind of the process
- Motion prediction from current frame.
- ME relative to reference frames.
27MethodologyThread-parallel MPEG-4 (8/8)
- Motion Prediction
- Examining the MVs in neighbouring MBs and
determining an initial estimate for ME.
Ideal pattern
typical pattern
TLP pattern
?
?
?
?
?
?
?
?
?
?
28MethodologyH.264 (1/6)
- Using x264 for implementation.
- Frame slicing
- Main problems of using MB-level
- Wide variation in processor workload.
- The modification of prediction algorithm is
needed.
29MethodologyH.264 (2/6)
- Slice group in H.264
- A group of MBs in a frame.
- Can be encoded or decoded separatedly from the
remainder of the frame. - Not allowing motion prediction cross slice
boundaries. - Drawback
- The required bit-rate increase.
30MethodologyH.264 (3/6)
- Comparison of different slice number
31MethodologyH.264 (4/6)
- Comparison of different slice number
32MethodologyH.264 (5/6)
- Different resolution with 4 slices
33MethodologyH.264 (6/6)
34Experimental ResultsMPEG-2
Search Range
35Experimental ResultsMPEG-4
Quality Setting
36Experimental ResultsH.264
Quantization Parameter
37Experimental ResultsComparative results
38Conclusions
- The DIC metric of MPEG-2, MPEG-4, and H.264 can
be greatly reduced by TLP. - For HD sequences, the improvement is around 84,
92, 96 respectively. - TLP has become more significant for each new
generation of video encoders.