MMX Architecture Programming and Performance Optimization - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

MMX Architecture Programming and Performance Optimization

Description:

conversions (e.g. pack, unpack) logical operations (e.g. ... most one MMX shift or pack or unpack instruction can be executed. ... pack, unpack) A (anything ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 43
Provided by: eceW
Category:

less

Transcript and Presenter's Notes

Title: MMX Architecture Programming and Performance Optimization


1
MMX Architecture Programmingand Performance
Optimization
  • The MMX Instruction Set
  • MMX Technology Optimization Techniques

2
MMX Instruction Overview
  • 57 new opcodes introduced
  • Instructions are grouped into the following
  • packed arithmetic (e.g. padd,
    psub)
  • conversions (e.g. pack,
    unpack)
  • logical operations (e.g. pand, por, pxor)
  • data transfer operations (e.g. movd, movq)
  • EMMS (Empty Multimedia State)

3
Packed Arithmetic Example 1
  • paddsw MM2, MM4
  • (Packed Add with Saturation for Word)
  • p Packed
  • add the Instruction
  • s Saturation
  • w Word

4
Packed Arithmetic Example 1
  • paddsw MM2, MM4

Saturated
Saturated
5
Packed Arithmetic Example 2
  • pmaddwd MM2, MM4
  • (Packed Multiply and Add)
  • Multiply packed words in parallel
  • Add the 32-bit results pairwise
  • Store in MMX register as dwords

6
Packed Arithmetic Example 2
  • pmaddwd MM2, MM4

Wraps around only when all source data elements
are 8000
7
Using Saturating Arithmetic
  • Example(absolute difference for 8-bit unsigned
    data)

Absolute differences
8
Optimization Techniques
  • MMX instructions utilize U and V pipes of the
    Pentium Processor
  • To write optimized MMX code, one must
  • understand the MMX instruction latencies
  • know how to pair MMX instructions
  • learn how to efficiently mix MMX and regular
    integer instructions
  • take cache structure into considerations

9
MMX Instruction Latencies
  • Latency rule 1 After modifying an MMX register,
    wait until the next clock(at least) before
    reading the same register(avoiding RAW hazards)
  • Example
  • movq mm0, eax U pipe 1
  • movq mm3, mm2 V pipe 1
  • paddw mm0, mm1 U pipe 2
  • movq mm2, mm0 STALL 3

10
MMX Instruction Latencies
  • Example(no stall!)
  • movq mm0, eax V pipe
  • movq mm3, mm2 U pipe
  • paddw mm0, mm1 V pipe
  • movq mm2, mm0 U pipe
  • Determining how the sequence of instructions will
    line up relative to the U and V pipes
  • branch target will be executed in U pipe

11
MMX Instruction Latencies
  • Latency rule 2 After issuing a multiply
    instruction(pmaddwd, pmulhw, pmullw), wait until
    three clocks later before using the result
  • Example
  • pmaddwd mm1,esi4ecx U pipe
  • movq mm0, mm2 V pipe
  • xxx dont use mm1
  • xxx still dont
  • xxx still dont
  • xxx still dont
  • paddd mm7, mm1 now you can!

12
MMX Instructions Latencies
  • Latency rule 3 After modifying an MMX register,
    wait until two clocks later before storing the
    result to either memory or an integer
    register(EAX, EBX, and so on)
  • Example
  • psubw mm0, mm1 mm0 gets modified
  • xxx dont store yet
  • xxx still dont
  • xxx still dont
  • movq edi,mm0 OK to store

13
Pairing MMX Instructions
  • Goal To achieve a maximum throughput of two
    instructions per processor clock
  • Following four basic MMX instruction pairing
    rules on the Pentium processor

14
Pairing MMX Instructions
  • Pairing rule 1 In each clock, at most one MMX
    multiplication instruction(pmaddwd, pmulhw, or
    pmullw) can be executed.
  • Example
  • pmaddwd mm0,mm1 a multiply
  • pmulhw mm2,mm3 another multiply will not pair
  • It will take 2 clocks to execute( or more if
    there are stalls due to the latency rules!)

15
Pairing MMX Instructions
  • Pairing rule 2 In each clock, at most one MMX
    shift or pack or unpack instruction can be
    executed.
  • Example The following combinations will not pair
  • 1.psrad 2.psrad 3.packdwss
  • psllw packdwss punpckwd

16
Pairing MMX Instructions
  • Pairing rule 3 In each clock, the UV
    instruction pair can contain at most one memory
    or integer register(EAX, EBX, etc.) reference,
    and if it contains one, it must be executed in
    the U pipe.
  • Example The following combinations will not pair
  • 1.movq mm0,esi 2.paddd mm0,mm1-U pipe
  • movd eax, mm1 movq mm2,esi

17
Pairing MMX Instructions
  • Pairing rule 4 For optimal pairing, avoid
    instructions that are more than 7 bytes long.
    Such instruction must be executed in U pipe and
    usually will not pair.(e.g. an 8 byte or longer
    instruction is one with a memory operand
    containing a base register, an index, and a
    32-bit displacement)
  • Example
  • paddusb mm3,esi4ecx10248 byte long
  • paddusb mm3,esi1024 index removed
  • Or
  • paddusb mm3,esi4ecx64 shortening the
  • offset to 8 bits
  • (offset can be
    0,8,32
  • bits only)

18
Mixing Integer and MMX Instructions
  • Integer and MMX instruction will pair if
  • a. the integer instruction is a pairable
    instruction for the pipe where it is being
    executed
  • b. the MMX instruction does not reference memory
    or integer register
  • Example
  • pmulh mm0,mm1 no mem/int-reg reference
  • add eax,4 V-pairable
  • add esi,edi U-pairable
  • padd mm2,mm3 no mem/int-reg reference

19
Mixing Floating-Point and MMX Instructions
  • Each transition between MMX code and
    floating-point code costs about 50 clocks
  • Do not mix floating-point and MMX instructions at
    instruction level
  • Reasonable to mix floating-point and MMX
    instructions at the module(function) level
  • Use EMMS instructions at the end of every MMX
    code sequence. If you dont
  • incorrect floating-point resulted produced
  • floating-point exceptions generated degrades
    performance

20
Software Pipelining
  • Dot product of and is
  • , two real vectors, can be conveniently
    represented for MMX technology programming as
    arrays of 16-bit integers of some length

21
Software Pipelining
  • Dot product in C code
  • dotprod 0
  • for (i0 iltn i)
  • dotprod xiyi
  • MMX implementation for the inner loop
  • esi points to x array, edi points to y
  • ecx is loop counter
  • dotprod
  • movq mm0,esi8ecx load xi
  • pmaddwd mm0,edi8ecx multiply by yi
  • paddd mm1,mm0 accumulate to mm1
  • dec ecx
  • jge dotprod

22
Software Pipelining
  • Analyzing the code using latency and pairing
    rules
  • code executed on the Pentium processor
  • dotprod pipe clock
  • movq mm0,esi8ecx U 1
  • V-pipe stall V
  • pmaddwd mm0,edi8ecx U 2
  • V-pipe stall V
  • U-pipe stall U 3
  • V-pipe stall V
  • U-pipe stall U 4
  • V-pipe stall Vmultiply now done
  • paddd mm1,mm0 U 5
  • dec ecx V (note int/MMX
    pairing)
  • jge dotprod U 6
  • V-pipe stall V
  • It takes six clocks to process four elements per
    iteration, i.e., 1.5 clocks per array element

23
Software Pipelining
  • Optimization Use additional iterations of the
    same loop to fill the empty slots
  • Abstraction of each instruction into a symbol
  • -Xi, where X could be
  • M (multiply)
  • S (shift, pack, unpack)
  • A (anything else)
  • the leading dash indicates a memory or integer
    register operand
  • subscript i distinguishes different instructions
  • Programming task reduced to a problem of filling
    a 2-D table with letters

24
Software Pipelining
  • Symbolizing the three instructions in the product
    code
  • -L movq
  • -M pmaddwd
  • A paddd
  • Each M falls one clock(or more) after the
    corresponding L, and each A falls three clocks(or
    more) after the corresponding M
  • Each U,V pair contains at most one M and at most
    one - (and the -, if present, must be in the U
    pipe)

25
Software Pipelining
  • Interleaving three iterations of the
    loop(dependencies are shown with arrows
    interleave factor k3)

26
Software Pipelining
  • Adding loop control
  • DEC and JGE can pair, but still need one extra
    clock for the decreased ECX to be available for
    next memory reference.
  • Including the loop control, it takes 8 clocks to
    process 12(34) array elements, i.e., 0.75
    clocks per array element.
  • Speed Up2(excluding loop prologue and epilogue)

27
Software Pipelining
  • The optimized code structure
  • I n/4-3 start with last quadword
    triplet
  • -L(I1) loop prologue
  • -M(I1)
  • -L(I2)
  • -M(I2)
  • --------------------------------------------------
  • loop_top loop begins here
  • -L(I)
  • A
  • -M(I)
  • stall(V)
  • -L(I-2) note change in index(decr. by
    3)
  • A

28
Software Pipelining
  • The optimized code structure(continued)
  • -M(I-2)
  • stall(v)
  • -L(I-1) note change in index
  • A
  • -M(I-1)
  • stall(v)
  • II-3 decrement I by 3
  • jg loop_top
  • --------------------------------------------------
  • -L(0) epilogue
  • A need to do A for I1,2
  • -M(0) and the whole(L,M,A)
  • A for I0
  • A

29
Cache Considerations
  • Optimization for the cache becomes even more
    important because of SIMD properties of MMX
    instructions
  • To optimize the way the program use the data
    cache, rearrange the way the data is located in
    the memory using these techniques
  • data alignment
  • separate v.s. compound array
  • rearranging data structure
  • padding and alignment

30
Cache Considerations
  • Restructure the way the code accesses data
  • loop interchange
  • loop fusion
  • blocking
  • To optimize instruction cache utilization,
    restructure the code to reduce code size at
    Assembly level

31
Data Alignment
  • Pentium processor can access at any byte boundary
  • Misaligned access cost 3 extra cycles
  • To avoid misaligned penalty, align data object
    according to their size
  • align 2-byte data so that it doesnt cross 4-byte
    boundary align 2-byte data on 2-byte boundary
  • align 8-byte MMX data on 8-byte boundary
  • Aligning C data structure example(clip)

32
Data Alignment
  • If (amount of computationgtgtamount of input data),
    then duplicate the input data with different
    alignments, so that any segment of the original
    data can be read by reading one of the aligned
    duplicate arrays.
  • If amount of input data is too large, then align
    it on the fly.
  • for example to read 8 bytes misaligned, read 2
    aligned quadwords on either side, then shift and
    OR together.

33
Separate v.s. Compound Array
  • First look at how the code accessed the array
    elements, then declare the structure in the same
    way
  • Example

34
Rearranging Data Structure
  • Put data elements that are accessed in parallel
    together
  • Put frequently used data elements together
  • When rearranging, be careful not to misalign the
    data

35
Padding and Aligning Arrays
  • When accessing array of data structure randomly
  • padding and aligning makes each structure span
    minimum number of cache lines necessary
  • for each structure, one cache miss is avoided
  • - padding enlarges data structure, thus less
    information stored in cache
  • - potential of capacity misses

36
Loop Interchange
  • Structure the code so that it access array
    elements within each cache line
  • C(row-wise) Fortran(column-wise
    )
  • Spatial locality-all the elements in a cache line
    are used before the line is replaced

37
Loop Fusion
  • Combine multiple loops over the same array into
    one single loop
  • Increase temporal locality
  • Often reduce capacity misses
  • separate loops fused loops

38
Blocking
  • Restructure the program so that it uses smaller
    blocks of data
  • blocking increases the temporal locality of the
    code
  • useful when multiplying large matrixes that can
    not fit into cache at the same time

39
Blocking
  • Blocked code
    Original code
  • few iteration means fewer misses
  • 4 elements from each line of A are used before
    the line is replaced
  • hit rate 96.68 ? 97.07 (1.38 improvement)
  • a reduction of 1 million cache misses, saving at
    least 3 million clocks

40
Reducing Code Size(Assembly Code)
  • Such that code size does not exceed 8K byte - the
    size of the instruction cache
  • Replace a sequence of single cycle instructions
    with single multi-cycle instructions
  • Pull address calculation into load/store
    instructions
  • Use shorter opcodes
  • Eliminate compares with immediate zeros
  • Shorten instructions using Pentium processors
    eax register

41
MMX Optimization Summary
  • Make data structure and memory access 8-byte
    aligned
  • Structure program and data to maximize
    instruction and data cache hits
  • For each function in the program, craft a minimal
    sequence of MMX codes with SIMD-like thinking.
  • Use software pipelining and loop unrolling

42
References
  • Intel Corp., David Bistry et. al, The Complete
    Guide to MMX Technology, McGraw-Hill, Inc., 1997
  • Hennesy, John L., and David A. Patterson,
    Computer Architecture A Quantitative Approach.
    2nd edition, Morgan Kaufmann, 1996
Write a Comment
User Comments (0)
About PowerShow.com