Native Signal Processing and MMX 2: Programming - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Native Signal Processing and MMX 2: Programming

Description:

Arrange data into the right format for parallel execution and memory access. ... Arrange code to minimize the misprediction in the branch prediction algorithm. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 23
Provided by: YuHe8
Category:

less

Transcript and Presenter's Notes

Title: Native Signal Processing and MMX 2: Programming


1
Native Signal Processing and MMX (2) Programming
2
Programming Environment
  • Eight 64-bit MMX registers.
  • MM0 MM7 aliased with first 64 bits of 80-bit
    floating point (FP) registers.
  • Four MMX data types.
  • packed bytes. (8-bit).
  • packed words. (16-bit).
  • packed doublewords. (32-bit).
  • quadwords. (64-bit).

3
Programming Environment (Cont.)
  • Key attributes of MMX instructions.
  • All integer instructions.
  • Only data-transfer instructions can have memory
    operand as the destination.
  • All non-data-transfer instructions must have an
    MMX register as the destination. The source can
    be either an MMX register or a memory location.
  • The mnemonic for every non-data-transfer
    instructions is prefixed with P for packed.
  • MMX instructions do NOT set flags.

4
Streaming Simd Extension (SSE)
  • SSE since Pentium III.
  • Physically add eight new 128 bit XMM registers
    and 70 instruction set. New machine state
    introduced.
  • Support four 32-bit single precision floating
    point operations in parallel. Recall all MMX SIMD
    instruction are all for mere integers
  • SSE2 since Pentium 4.
  • Use XMM registers. No new machine.
  • 144 new instructions added.
  • Support double precision floating point parallel
    operations.
  • ItaniumTM (IA-64) Processor.
  • Enable, enhance, express, exploit Parallelism at
    Proc./Thread level for programmers, at the
    instruction level for compilers. All explicitly.

5
Basic Principles
  • Vectorize the operations, i.e. put multiple
    identical operations in one instruction.
  • Use the smallest possible data type to enable
    more parallelism.
  • Arrange data into the right format for parallel
    execution and memory access.
  • Reduce shuffling and maximize pairing.
  • Do not over-optimize the algorithms in terms of
    the number of scalar operation.

6
Matching the Algorithms to MMX Instruction
Capabilities
  • Usually, multiplications are more expensive in
    implementations. In MMX, however, paired
    multiplications are cheap and one algorithm with
    more multiplications can be better than another.
  • MMX Offers best support for 8-bit and 16-bit
    integer data types. Some signal processing
    applications seem to cause problems due to their
    higher-precision requirement.

7
MMX Instructions-level Code Optimization
  • Do not intermix MMX instructions and
    floating-point instructions. MMX instructions do
    not mix well with floating-point instructions.
  • MMX instructions, which reference memory or
    integer registers, do not mix well with integer
    instructions referencing same memory or
    registers.
  • Arrange data properly. Columnwise processing is
    better than sequential rowwise processing
  • For Pentium III or earlier Intel architecture
  • MMX shift/pack/unpack instructions do not mix
    well with each other. Only one MMX
    shift/pack/unpack instructions can be executed at
    one clock because there is only one shifter unit.
  • MMX multiplications pmull/pmulh/pmadd do not
    mix well with each other.

8
Code Optimization Guidelines
  • Understand where the application spends most of
    its execution time.
  • Understand which algorithm is the best for MMX
    technology in this application.
  • Understand where data values in the application
    be converted to integer while maintaining the
    required range and precision.
  • Use a current generation compiler that will
    produce an optimized application.
  • VC6, Delphi6, BCB6, Intel Compiler, PGCC, etc

9
Code Optimization Guidelines (Cont.)
  • Maximize memory access performance
  • Minimize memory use, maximize register usage.
  • Prefetch data and make sure all data are aligned.
  • Align frequently executed branch targets on
    16-byte boundaries.
  • Avoid partial register stalls.
  • Load and store data to the same area of memory
    using the same data sizes and address alignments.
  • Minimize branching penalties
  • Minimize branch instructions, for instance,
    unroll small loops.
  • Arrange code to minimize the misprediction in the
    branch prediction algorithm.
  • Use software pipelining to schedule latencies and
    functional unit

10
Vector-Matrix Multiplication
  • C Code (original)
  • int16 vectY_SIZE
  • int16 matrY_SIZEX_SIZE
  • int16 resultX_SIZE
  • int32 accum
  • for (i0 iltX_SIZE i)
  • accum0
  • for (j0 jltY_SIZE j)
  • accum vectjmatrji
  • resultiaccum

11
Matching Data Structure to MMX Instructions
  • With MMX data format, it can handle a 1x2 vector
    times a 2x4 matrix at a time.
  • Hence the goal is to partition the matrix into 2
    x 4 blocks.
  • The L1 cache line is 256-bit wide. Thus, one may
    want to use most of the cached data to avoid
    reloading the same data multiple times.

12
Modified C Code
  • int16 vectY_SIZE
  • int16 matrY_SIZEX_SIZE
  • int16 resultX_SIZE
  • int32 accum4
  • for (i0 iltX_SIZE i4)
  • accum 0,0,0,0
  • for (j0 jltY_SIZE j2)
  • accumMULT4x2(vectj,matrji)
  • resulti..i3 accum
  • The C code is modified for MMX format so that
    accumulators is now a 1 x 4 array.

13
Notation and Memory Map
  • Each element is a 2-byte word, occupying 2
    addresses.
  • The first element of the 1x2 vector v1 is pointed
    to by the content of the stack pointer esi.
  • The first element of first row is pointed to by
    edx.
  • of columns is stored in ecx. Hence a21s
    address is edx 2ecx

14
MULT4x2 MMX Loop Body
  • movd mm7, esi Load two elements from input
    vector
  • punpckldq mm7, mm7 Duplicate input vector
    v0v1v0v1
  • movq mm6, edx0 Load first line of matrix (4
    elements)
  • movq mm0, edx2ecx Load second line of matrix
    (4 elements)
  • movq mm1, mm0 Transpose matrix to column
    presentation
  • punpcklwd mm0, mm6 mm0 keeps columns 0 and 1
  • punpckhwd mm1, mm6 mm1 keeps columns 2 and 3
  • pmaddwd mm0, mm7 multiply and add the 1st and
    2nd column
  • pmaddwd mm1, mm7 multiply and add the 3rd and
    4th column
  • paddd mm2, mm0 accumulate 32 bit results for
    col. 0/1
  • paddd mm3, mm1 accumulate 32 bit results for
    col. 2/3
  • packssdw mm2, mm2 Pack the results for columns
    0 and 1 to 16 Bits
  • packssdw mm3, mm3 Pack the results for columns
    2 and 3 to 16 Bits
  • punpckldq mm2, mm3 All four 16 Bit results in
    one register (mm2)
  • movq edi, mm2 Store four results into output

15
Trace the program
Content of register after execution of the
instruction
  • MOVD mm7, esi
  • PUNPCKLDQ mm7, mm7
  • Note DQ means v1, v2 are moved together.
  • MOVQ mm6, edx 0
  • MOVQ mm0, edx2ecx
  • MOVQ mm1, mm0
  • PUNPCKLWD mm0, mm6

XX
XX
v1
v2
mm7
v1
v2
v1
v2
mm7
a11
a12
a13
a14
mm6
a21
a22
a23
a24
mm0
a21
a22
a23
a24
mm1
mm0
mm6
a13
a23
a14
a24
mm0
16
Trace
mm1
mm6
  • PUNPCKHWD mm1, mm6
  • PMADDWD mm0, mm7
  • (overflow may occur, but only wrap around is
    available)
  • PMADDWD mm1, mm7

a11
a21
a12
a22
mm1
a13
a23
a14
a24
mm0
v1
v2
v1
v2
mm7
v1a13 v2a23
v1a14 v2a24
mm0
v1a11 v2a21
v1a12 v2a22
mm1
17
Trace
  • PADDD mm2, mm0
  • Mm2, mm3 are accumulators so the result can be
    used later. Wrap around overflow control
  • PADDD mm3, mm1
  • PACKSSDW mm2, mm2
  • PACKSSDW mm3, mm3
  • PUNPCKLDQ mm2, mm3
  • MOVQ edi, mm2

v1a13 v2a23
v1a14 v2a24
mm2
v1a11 v2a21
v1a12 v2a22
mm3
b3
b4
b3
b4
mm2
b1
b2
b1
b2
mm3
b1
b2
b3
b4
mm2
18
Loop Unrolling
  • int16 vectY_SIZE
  • int16 matrY_SIZEX_SIZE
  • int16 resultX_SIZE
  • int32 accum16
  • for (i0 iltX_SIZE i16)
  • accum 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
  • for (j0 jltY_SIZE j2)
  • accum0..3 MULT4x2(vectj,
    matrji)
  • accum4..7 MULT4x2(vectj,
    matrji4)
  • accum8..11 MULT4x2(vectj,
    matrji8)
  • accum12..15 MULT4x2(vectj,
    matrji12)
  • resulti..i15 accum

19
Register Assignments
  • The four instances of the MULT4x2 routine in the
    unrolled code will be assigned to different
    registers to hold the temporary results.

20
Instruction Scheduling
21
Useful URL (May need to be updated)
  • http//www.intel.com/design/mmx/manuals/
  • Intel MMX manuals, application notes, etc..
  • http//support.intel.com/support/performancetools/
  • Intel Performance Tools.
  • http//support.intel.com/support/performancetools/
    libraries/
  • Intel Performance Library Suite
  • http//coolgraphix.homepage.com/
  • Software with JPEG compression using MMX by TY.
  • http//www.zdnet.com/pcmag/features/mmx/mmx-s4.htm

22
References
  • Intel Corp., David Bistry et. al, The Complete
    Guide to MMX Technology, McGraw-Hill, Inc., 1997
Write a Comment
User Comments (0)
About PowerShow.com