MMX-accelerated Matrix Multiplication - PowerPoint PPT Presentation

About This Presentation
Title:

MMX-accelerated Matrix Multiplication

Description:

Title: Traditional Matrix Multiply Algorithm Author: balasai Last modified by: Wendy Tsai Created Date: 10/10/2006 6:08:09 AM Document presentation format – PowerPoint PPT presentation

Number of Views:215
Avg rating:3.0/5.0
Slides: 30
Provided by: bala80
Category:

less

Transcript and Presenter's Notes

Title: MMX-accelerated Matrix Multiplication


1
MMX-accelerated Matrix Multiplication
  • Assembly Language System Software
  • National Chiao-Tung Univ.

2
Motivation
  • Pentium processors support SIMD instructions for
    vector operations
  • Multiple operations can be perform in parallel
  • In this lecture, we shall show how to accelerate
    matrix multiplication by using MMX instructions

3
Naïve Matrix Multiplication
4
Naïve Matrix Multiplication
  • int16 vectY_SIZE
  • int16 matrY_SIZEX_SIZE
  • int16 resultX_SIZE
  • int32 accum
  • for (i 0 i lt X_SIZE i)
  • accum 0
  • for (j 0 j lt Y_SIZE j)
  • accum vectj matrji
  • resulti accum

5
MMX
  • A collection of
  • new SIMD instructions
  • new registers
  • mm0mm7, each is of 64 bits
  • MMX is primarily for integer vector operations

6
MMXTM registers
mmx register
float
mmx
char a
a
8 bits
int b
b1
b2
b3
b4
64 bits
32 bits
80 bits
p
p8
16
16
16
16
16
16
16
16
16
16
16
16
64 bits
64 bits
64 bits
7
MMX instructions
  • movd?movqMove Doubleword?Move Quadword
  • punpcklbw?punpcklwd?punpckldqUnpack Low Data and
    Interleave (word?doubleword)
  • punpckhwdUnpack High Data and Interleave (word)

LBW
HBW
8
MMX instructions
  • pmaddwdMultiply and Add Packed Integers (word)
  • padddAdd Packed Integers (doubleword)

9
MMX for Matrix Multiply
  • One matrix multiplication is divide into a series
    of multiplying a 12 vector with a 24 sub-matrix

10
MMX for Matrix Multiply
edx
esi
ecx elements
11
MMX for Matrix Multiply
  • int16 vectY_SIZE
  • int16 matrY_SIZEX_SIZE
  • int16 resultX_SIZE
  • int32 accum4
  • for (i 0 i lt X_SIZE i 4)
  • accum 0, 0, 0, 0
  • for (j 0 j lt Y_SIZE j 2)
  • accum MULT4x2 (vectj, matrji)
  • resulti..i 3 accum

12
MMX code for MULT4x2
  • MULT4x2
  • movd mm7, esi Load two elements from input
    vector
  • punpckldq mm7, mm7 Duplicate input vector
    x0x1x0x1
  • movq mm0, edx0 Load first line of matrix (4
    elements)
  • movq mm6, edx2ecx Load second line of
    matrix (4 elements)
  • movq mm1, mm0 Transpose matrix to column
    presentation
  • punpcklwd mm0, mm6 mm0 keeps columns 0 and 1
  • punpckhwd mm1, mm6 mm1 keeps columns 2 and 3
  • pmaddwd mm0, mm7 multiply and add the 1st
    and 2nd column
  • pmaddwd mm1, mm7 multiply and add the 3rd
    and 4th column
  • paddd mm2, mm0 accumulate 32 bit results for
    col. 0/1
  • paddd mm3, mm1 accumulate 32 bit results for
    col. 2/3

13
MMX code for MULT4x2
  • Matrix states in multiplication
  • movd mm7, esi Load two elements from
    input vector
  • punpckldq mm7, mm7 Duplicate input vector
    X0X1X0X1

14
MMX code for MULT4x2
  • movq mm0, edx0 Load first line of matrix
  • the 4x2 block is addressed through register edx
  • movq mm6, edx2ecx Load second line of
    matrix
  • ecx contains the number of elements per matrix
    line

15
MMX code for MULT4x2
  • movq mm1, mm0 Transpose matrix to column
    presentation
  • punpcklwd mm0, mm6 mm0 keeps columns 0 and 1
  • punpckhwd mm1, mm6 mm1 keeps columns 2 and 3

16
MMX code for MULT4x2
  • pmaddwd mm0, mm7multiply and add the 1st and 2nd
    column
  • pmaddwd mm1, mm7multiply and add the 3rd and 4th
    column

17
MMX code for MULT4x2
  • paddd mm2, mm0 accumulate 32 bit results for
    col. 0/1
  • paddd mm3, mm1 accumulate 32 bit results for
    col. 2/3

18
MMX code for MULT4x2
  • Packing and storing results
  • packssdw mm2, mm2 Pack the results for columns
    0 and 1 to 16 Bits
  • packssdw mm3, mm3 Pack the results for columns
    2 and 3 to 16 Bits
  • punpckldq mm2, mm3 All four 16 Bit results in
    one register (mm2)
  • movq edi, mm2 Store four results into output
    vector

19
MMX code for MULT4x2
  • packssdw mm2,mm2
  • packssdw mm3,mm3
  • Convert (shrink) signed DWORDs into WORDs

20
Little endian ?Y, Z, W,V
21
Memory Alignment
  • Memory operations for MMX must be aligned at
    8-byte boundaries
  • 16-byte boundaries for SSE2
  • .data
  • ALIGN 8
  • myBuf DWORD 128 DUP(?)

22
CPU-Mode Directives
  • In Irvine32.inc, the CPU mode is specified as
    .686P
  • MMX is supported since Pentium
  • Additionally, you should specify .mmx to use MMX
    instructions
  • If you want to use SSE2, specify .xmm

23
Debugging with MMX
MMX/SSE2 registers are hidden unless you specify
to see them
24
High-Resolution Counter
  • A PC clock ticks 18.7 times every second
  • Low resolution
  • Use the CPU internal clock counter for high
    accuracy performance measurement

25
High-Resolution Counter
  • RDTSC
  • Read the CPU cycle counter
  • 1 every clock
  • 3000000000 every second for a 3GHz CPU
  • The result is put in EDXEAX

readTSC PROC rdtsc ret readTSC ENDP
26
High-Resolution Counter
  • To calculate time spent in a specific interval,
  • Recording the starting time and finish tine
  • Finish-start
  • Time stamps are of 64 bits, SUB instruction is
    for up to 32-bit operands
  • Use SBB (sub with borrow) for implementation

27
SSE2
  • SIMD instructions for MMX extension
  • Basically SSE2 and MMX are the sane, except
  • Registers for SSE2 are 128 bits instead of 64
    bits, named by xmm0xmm7
  • 8 16-bit integers in one single register
  • xmm8xmm15 are accessible only with 64-bit
    processors
  • Memory operations should be aligned at 16-byte
    boundaries
  • Use .xmm directive to enable SSE2 for MASM
  • Use MOVDQ instead of MOVQ for data movement

28
From MMX to SSE2
  • Change the multiplication for 12 x 24 matrixes
  • 1? To ??
  • The rest are almost the same!

29
Things you have to do
  • Understand the code of MUL4x2
  • Extend the logic to handle generic matrix
    multiplication
  • Understand alignment of memory operations
  • Remember to put an EMMS instruction by the end
    of your program
  • Not required if you are using SSE2
  • Implement 1) naïve 2) MMX-based 3) SSE2-based
    algorithms and measure their performance
Write a Comment
User Comments (0)
About PowerShow.com