MMX-accelerated Matrix Multiplication

About This Presentation

Title:

MMX-accelerated Matrix Multiplication

Description:

Title: Traditional Matrix Multiply Algorithm Author: balasai Last modified by: Wendy Tsai Created Date: 10/10/2006 6:08:09 AM Document presentation format – PowerPoint PPT presentation

Number of Views:215

Avg rating:3.0/5.0

Slides: 30

Provided by: bala80

Category:

more less

Transcript and Presenter's Notes

Title: MMX-accelerated Matrix Multiplication

1
MMX-accelerated Matrix Multiplication

Assembly Language System Software
National Chiao-Tung Univ.

2
Motivation

Pentium processors support SIMD instructions for
vector operations
Multiple operations can be perform in parallel
In this lecture, we shall show how to accelerate
matrix multiplication by using MMX instructions

3
Naïve Matrix Multiplication
4
Naïve Matrix Multiplication

int16 vectY_SIZE
int16 matrY_SIZEX_SIZE
int16 resultX_SIZE
int32 accum
for (i 0 i lt X_SIZE i)
accum 0
for (j 0 j lt Y_SIZE j)
accum vectj matrji
resulti accum

5
MMX

A collection of
new SIMD instructions
new registers
mm0mm7, each is of 64 bits
MMX is primarily for integer vector operations

6
MMXTM registers
mmx register
float
mmx
char a
a
8 bits
int b
b1
b2
b3
b4
64 bits
32 bits
80 bits
p
p8
16
16
16
16
16
16
16
16
16
16
16
16
64 bits
64 bits
64 bits
7
MMX instructions

movd?movqMove Doubleword?Move Quadword
punpcklbw?punpcklwd?punpckldqUnpack Low Data and
Interleave (word?doubleword)
punpckhwdUnpack High Data and Interleave (word)

LBW
HBW
8
MMX instructions

pmaddwdMultiply and Add Packed Integers (word)
padddAdd Packed Integers (doubleword)

9
MMX for Matrix Multiply

One matrix multiplication is divide into a series
of multiplying a 12 vector with a 24 sub-matrix

10
MMX for Matrix Multiply
edx
esi
ecx elements
11
MMX for Matrix Multiply

int16 vectY_SIZE
int16 matrY_SIZEX_SIZE
int16 resultX_SIZE
int32 accum4
for (i 0 i lt X_SIZE i 4)
accum 0, 0, 0, 0
for (j 0 j lt Y_SIZE j 2)
accum MULT4x2 (vectj, matrji)
resulti..i 3 accum

12
MMX code for MULT4x2

MULT4x2
movd mm7, esi Load two elements from input
vector
punpckldq mm7, mm7 Duplicate input vector
x0x1x0x1
movq mm0, edx0 Load first line of matrix (4
elements)
movq mm6, edx2ecx Load second line of
matrix (4 elements)
movq mm1, mm0 Transpose matrix to column
presentation
punpcklwd mm0, mm6 mm0 keeps columns 0 and 1
punpckhwd mm1, mm6 mm1 keeps columns 2 and 3
pmaddwd mm0, mm7 multiply and add the 1st
and 2nd column
pmaddwd mm1, mm7 multiply and add the 3rd
and 4th column
paddd mm2, mm0 accumulate 32 bit results for
col. 0/1
paddd mm3, mm1 accumulate 32 bit results for
col. 2/3

13
MMX code for MULT4x2

Matrix states in multiplication
movd mm7, esi Load two elements from
input vector
punpckldq mm7, mm7 Duplicate input vector
X0X1X0X1

14
MMX code for MULT4x2

movq mm0, edx0 Load first line of matrix
the 4x2 block is addressed through register edx
movq mm6, edx2ecx Load second line of
matrix
ecx contains the number of elements per matrix
line

15
MMX code for MULT4x2

movq mm1, mm0 Transpose matrix to column
presentation
punpcklwd mm0, mm6 mm0 keeps columns 0 and 1
punpckhwd mm1, mm6 mm1 keeps columns 2 and 3

16
MMX code for MULT4x2

pmaddwd mm0, mm7multiply and add the 1st and 2nd
column
pmaddwd mm1, mm7multiply and add the 3rd and 4th
column

17
MMX code for MULT4x2

paddd mm2, mm0 accumulate 32 bit results for
col. 0/1
paddd mm3, mm1 accumulate 32 bit results for
col. 2/3

18
MMX code for MULT4x2

Packing and storing results
packssdw mm2, mm2 Pack the results for columns
0 and 1 to 16 Bits
packssdw mm3, mm3 Pack the results for columns
2 and 3 to 16 Bits
punpckldq mm2, mm3 All four 16 Bit results in
one register (mm2)
movq edi, mm2 Store four results into output
vector

19
MMX code for MULT4x2

packssdw mm2,mm2
packssdw mm3,mm3
Convert (shrink) signed DWORDs into WORDs

20
Little endian ?Y, Z, W,V
21
Memory Alignment

Memory operations for MMX must be aligned at
8-byte boundaries
16-byte boundaries for SSE2
.data
ALIGN 8
myBuf DWORD 128 DUP(?)

22
CPU-Mode Directives

In Irvine32.inc, the CPU mode is specified as
.686P
MMX is supported since Pentium
Additionally, you should specify .mmx to use MMX
instructions
If you want to use SSE2, specify .xmm

23
Debugging with MMX
MMX/SSE2 registers are hidden unless you specify
to see them
24
High-Resolution Counter

A PC clock ticks 18.7 times every second
Low resolution
Use the CPU internal clock counter for high
accuracy performance measurement

25
High-Resolution Counter

RDTSC
Read the CPU cycle counter
1 every clock
3000000000 every second for a 3GHz CPU
The result is put in EDXEAX

readTSC PROC rdtsc ret readTSC ENDP
26
High-Resolution Counter

To calculate time spent in a specific interval,
Recording the starting time and finish tine
Finish-start
Time stamps are of 64 bits, SUB instruction is
for up to 32-bit operands
Use SBB (sub with borrow) for implementation

27
SSE2

SIMD instructions for MMX extension
Basically SSE2 and MMX are the sane, except
Registers for SSE2 are 128 bits instead of 64
bits, named by xmm0xmm7
8 16-bit integers in one single register
xmm8xmm15 are accessible only with 64-bit
processors
Memory operations should be aligned at 16-byte
boundaries
Use .xmm directive to enable SSE2 for MASM
Use MOVDQ instead of MOVQ for data movement

28
From MMX to SSE2

Change the multiplication for 12 x 24 matrixes
1? To ??
The rest are almost the same!

29
Things you have to do

Understand the code of MUL4x2
Extend the logic to handle generic matrix
multiplication
Understand alignment of memory operations
Remember to put an EMMS instruction by the end
of your program
Not required if you are using SSE2
Implement 1) naïve 2) MMX-based 3) SSE2-based
algorithms and measure their performance

Write a Comment

User Comments (0)