Native Signal Processing and MMX 2: Programming

About This Presentation

Title:

Native Signal Processing and MMX 2: Programming

Description:

Arrange data into the right format for parallel execution and memory access. ... Arrange code to minimize the misprediction in the branch prediction algorithm. ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 23

Provided by: YuHe8

Category:

more less

Transcript and Presenter's Notes

Title: Native Signal Processing and MMX 2: Programming

1
Native Signal Processing and MMX (2) Programming
2
Programming Environment

Eight 64-bit MMX registers.
MM0 MM7 aliased with first 64 bits of 80-bit
floating point (FP) registers.
Four MMX data types.
packed bytes. (8-bit).
packed words. (16-bit).
packed doublewords. (32-bit).
quadwords. (64-bit).

3
Programming Environment (Cont.)

Key attributes of MMX instructions.
All integer instructions.
Only data-transfer instructions can have memory
operand as the destination.
All non-data-transfer instructions must have an
MMX register as the destination. The source can
be either an MMX register or a memory location.
The mnemonic for every non-data-transfer
instructions is prefixed with P for packed.
MMX instructions do NOT set flags.

4
Streaming Simd Extension (SSE)

SSE since Pentium III.
Physically add eight new 128 bit XMM registers
and 70 instruction set. New machine state
introduced.
Support four 32-bit single precision floating
point operations in parallel. Recall all MMX SIMD
instruction are all for mere integers

SSE2 since Pentium 4.
Use XMM registers. No new machine.
144 new instructions added.
Support double precision floating point parallel
operations.
ItaniumTM (IA-64) Processor.
Enable, enhance, express, exploit Parallelism at
Proc./Thread level for programmers, at the
instruction level for compilers. All explicitly.

5
Basic Principles

Vectorize the operations, i.e. put multiple
identical operations in one instruction.
Use the smallest possible data type to enable
more parallelism.
Arrange data into the right format for parallel
execution and memory access.
Reduce shuffling and maximize pairing.
Do not over-optimize the algorithms in terms of
the number of scalar operation.

6
Matching the Algorithms to MMX Instruction
Capabilities

Usually, multiplications are more expensive in
implementations. In MMX, however, paired
multiplications are cheap and one algorithm with
more multiplications can be better than another.
MMX Offers best support for 8-bit and 16-bit
integer data types. Some signal processing
applications seem to cause problems due to their
higher-precision requirement.

7
MMX Instructions-level Code Optimization

Do not intermix MMX instructions and
floating-point instructions. MMX instructions do
not mix well with floating-point instructions.
MMX instructions, which reference memory or
integer registers, do not mix well with integer
instructions referencing same memory or
registers.
Arrange data properly. Columnwise processing is
better than sequential rowwise processing
For Pentium III or earlier Intel architecture
MMX shift/pack/unpack instructions do not mix
well with each other. Only one MMX
shift/pack/unpack instructions can be executed at
one clock because there is only one shifter unit.
MMX multiplications pmull/pmulh/pmadd do not
mix well with each other.

8
Code Optimization Guidelines

Understand where the application spends most of
its execution time.
Understand which algorithm is the best for MMX
technology in this application.
Understand where data values in the application
be converted to integer while maintaining the
required range and precision.
Use a current generation compiler that will
produce an optimized application.
VC6, Delphi6, BCB6, Intel Compiler, PGCC, etc

9
Code Optimization Guidelines (Cont.)

Maximize memory access performance
Minimize memory use, maximize register usage.
Prefetch data and make sure all data are aligned.
Align frequently executed branch targets on
16-byte boundaries.
Avoid partial register stalls.
Load and store data to the same area of memory
using the same data sizes and address alignments.
Minimize branching penalties
Minimize branch instructions, for instance,
unroll small loops.
Arrange code to minimize the misprediction in the
branch prediction algorithm.
Use software pipelining to schedule latencies and
functional unit

10
Vector-Matrix Multiplication

C Code (original)
int16 vectY_SIZE
int16 matrY_SIZEX_SIZE
int16 resultX_SIZE
int32 accum
for (i0 iltX_SIZE i)
accum0
for (j0 jltY_SIZE j)
accum vectjmatrji
resultiaccum

11
Matching Data Structure to MMX Instructions

With MMX data format, it can handle a 1x2 vector
times a 2x4 matrix at a time.
Hence the goal is to partition the matrix into 2
x 4 blocks.

The L1 cache line is 256-bit wide. Thus, one may
want to use most of the cached data to avoid
reloading the same data multiple times.

12
Modified C Code

int16 vectY_SIZE
int16 matrY_SIZEX_SIZE
int16 resultX_SIZE
int32 accum4
for (i0 iltX_SIZE i4)
accum 0,0,0,0
for (j0 jltY_SIZE j2)
accumMULT4x2(vectj,matrji)
resulti..i3 accum
The C code is modified for MMX format so that
accumulators is now a 1 x 4 array.

13
Notation and Memory Map

Each element is a 2-byte word, occupying 2
addresses.
The first element of the 1x2 vector v1 is pointed
to by the content of the stack pointer esi.
The first element of first row is pointed to by
edx.
of columns is stored in ecx. Hence a21s
address is edx 2ecx

14
MULT4x2 MMX Loop Body

movd mm7, esi Load two elements from input
vector
punpckldq mm7, mm7 Duplicate input vector
v0v1v0v1
movq mm6, edx0 Load first line of matrix (4
elements)
movq mm0, edx2ecx Load second line of matrix
(4 elements)
movq mm1, mm0 Transpose matrix to column
presentation
punpcklwd mm0, mm6 mm0 keeps columns 0 and 1
punpckhwd mm1, mm6 mm1 keeps columns 2 and 3
pmaddwd mm0, mm7 multiply and add the 1st and
2nd column
pmaddwd mm1, mm7 multiply and add the 3rd and
4th column
paddd mm2, mm0 accumulate 32 bit results for
col. 0/1
paddd mm3, mm1 accumulate 32 bit results for
col. 2/3
packssdw mm2, mm2 Pack the results for columns
0 and 1 to 16 Bits
packssdw mm3, mm3 Pack the results for columns
2 and 3 to 16 Bits
punpckldq mm2, mm3 All four 16 Bit results in
one register (mm2)
movq edi, mm2 Store four results into output

15
Trace the program
Content of register after execution of the
instruction

MOVD mm7, esi
PUNPCKLDQ mm7, mm7
Note DQ means v1, v2 are moved together.
MOVQ mm6, edx 0
MOVQ mm0, edx2ecx
MOVQ mm1, mm0
PUNPCKLWD mm0, mm6

XX
XX
v1
v2
mm7
v1
v2
v1
v2
mm7
a11
a12
a13
a14
mm6
a21
a22
a23
a24
mm0
a21
a22
a23
a24
mm1
mm0
mm6
a13
a23
a14
a24
mm0
16
Trace
mm1
mm6

PUNPCKHWD mm1, mm6
PMADDWD mm0, mm7
(overflow may occur, but only wrap around is
available)
PMADDWD mm1, mm7

a11
a21
a12
a22
mm1
a13
a23
a14
a24
mm0
v1
v2
v1
v2
mm7
v1a13 v2a23
v1a14 v2a24
mm0
v1a11 v2a21
v1a12 v2a22
mm1
17
Trace

PADDD mm2, mm0
Mm2, mm3 are accumulators so the result can be
used later. Wrap around overflow control
PADDD mm3, mm1
PACKSSDW mm2, mm2
PACKSSDW mm3, mm3
PUNPCKLDQ mm2, mm3
MOVQ edi, mm2

v1a13 v2a23
v1a14 v2a24
mm2
v1a11 v2a21
v1a12 v2a22
mm3
b3
b4
b3
b4
mm2
b1
b2
b1
b2
mm3
b1
b2
b3
b4
mm2
18
Loop Unrolling

int16 vectY_SIZE
int16 matrY_SIZEX_SIZE
int16 resultX_SIZE
int32 accum16
for (i0 iltX_SIZE i16)
accum 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
for (j0 jltY_SIZE j2)
accum0..3 MULT4x2(vectj,
matrji)
accum4..7 MULT4x2(vectj,
matrji4)
accum8..11 MULT4x2(vectj,
matrji8)
accum12..15 MULT4x2(vectj,
matrji12)
resulti..i15 accum

19
Register Assignments

The four instances of the MULT4x2 routine in the
unrolled code will be assigned to different
registers to hold the temporary results.

20
Instruction Scheduling
21
Useful URL (May need to be updated)

http//www.intel.com/design/mmx/manuals/
Intel MMX manuals, application notes, etc..
http//support.intel.com/support/performancetools/
Intel Performance Tools.
http//support.intel.com/support/performancetools/
libraries/
Intel Performance Library Suite
http//coolgraphix.homepage.com/
Software with JPEG compression using MMX by TY.
http//www.zdnet.com/pcmag/features/mmx/mmx-s4.htm

22
References

Intel Corp., David Bistry et. al, The Complete
Guide to MMX Technology, McGraw-Hill, Inc., 1997

Write a Comment

User Comments (0)

About PowerShow.com

Native Signal Processing and MMX 2: Programming - PowerPoint PPT Presentation

Native Signal Processing and MMX 2: Programming

Arrange data into the right format for parallel execution and memory access. ... Arrange code to minimize the misprediction in the branch prediction algorithm. ... – PowerPoint PPT presentation