P573 Scientific Computing Lecture 6 Matrix Operations 2 - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

P573 Scientific Computing Lecture 6 Matrix Operations 2

Description:

This leads us to a 2nd matrix ... So, we yield a 3rd matrix. Alex's car contains 38420 3/8' nuts, 7979034 5/8' bolts, 60 valves, ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 28

Provided by: david3080

Category:

more less

Transcript and Presenter's Notes

Title: P573 Scientific Computing Lecture 6 Matrix Operations 2

1
P573Scientific ComputingLecture 6 - Matrix
Operations 2

Peter Gottschling
pgottsch_at_cs.indiana.edu
www.osl.iu.edu/pgottsch/courses/p573-06

2
Overview

Definitions
Addition
Norms
Matrix vector product
Naïve matrix product
Alternative implementations
Blocking

3
Motivation Matrix Product

Where matrix multiplication comes from?
Lets do some unscientific computing
Imagine you have lists of objects and of how many
components they consists of
My car consists of a 3l-engine, 4 doors, 4 side
windows,
Alexs car consists of a 15l-engine, 2 doors, 2
side windows,
We can write this as a matrix?

3l engine 15l engine Doors Side windows
Peters car Alexs car Jacobs car
4
Motivation Matrix Product (ii)

Now we have a 2nd list
How the components are composed of pieces
The 3l engine contains 20 3/8 nuts, 12 5/8
bolts, 12 valves,
One door contains 7 3/8 nuts, 5 5/8 bolts, 0
valves
This leads us to a 2nd matrix
Now the stunning question How many 5/8 bolts
contains Alexs car?

3/8 nuts 5/8 bolts Valves
3l engine 15l engine Door Side window
5
Motivation Matrix Product (iii)

This is easy to compute
We multiply the number of components with the
number of pieces
Then we sum this products over all components
Of course we can do this for all cars and all
components
So, we yield a 3rd matrix
Alexs car contains 38420 3/8 nuts, 7979034 5/8
bolts, 60 valves,
Jacobs car contains
Such things you learn in vocational school not
universities

3/8 nuts 5/8 bolts Valves
Peters car Alexs car Jacobs car
6
Definition Matrix Product
l

Mapping
For matrices A(aij), B(bjk), C(bik)
The product CAB is defined

B
n
n
C
A
m
Cik
7
Operator Concatenation

Recall matrices map from one vector space to
another
Say B maps from V1 to V2
x ? V1 ? Bx ? V2
Say A maps from V2 to V3
y ? V2 ? Ay ? V3
Then AB maps from V1 to V3
x ? V1 ? ABx ? V3
Obviously dim(V2) rows(B) colums(A)

8
Programming Matrix Product

Computing dot product for all i and k
double a new doubleMN, b new doubleNL,
c new doubleML
fill(a0, aMN, 2.0) fill(b0, bNL,
3.0)
fill(c0, cML, 0.0)
for (unsigned i 0 i lt M i)
for (unsigned k 0 k lt L k)
for (unsigned j 0 j lt N j)
ciLk aiNj bjLk
Does the result change if we change the loops?

9
Changing Loop Order

No, any order yields correct results, e.g. k, j,
i
fill(c0, cML, 0.0)
for (unsigned k 0 k lt L k)
for (unsigned j 0 j lt N j)
for (unsigned i 0 i lt M i)
ciLk aiNj bjLk
?k C(, k) 0 over ? j C(, k) A(, j) bjk
One column of C is incremented by a column of A
times scaled by an element of B
This is called daxpy double a?x plus y
Or saxpy for single precision, or caxpy for
complex
Or gaxpy in general (generic?)
Does this change our compute time?

10
Run Time Analysis

Memory traffic
A mn loads
B nl loads
C ml stores
Operations
Multiplications mnl
Additions ml(n-1)
For simplicity, we consider m n l
2n2 loads
n2 stores
2n3 operations (approximately)
Is this what we really do?

11
Run Time Analysis

We cant store the whole matrix in processor
May be in cache?
We have to reload values
Even the values of C
Unless we compute cik in the innermost loop
In reality we have
3n3 loads (instead of 2n2)
n3 stores (instead of n2)
2n3 operations
How many of them are from or to the memory?
And how many from or to the cache?
This depends on loop order

12
n2 Vector Operations

All 6 variants are composed of vector operations

13
Results for 500 by 500

gt matrix_product 500
i, j, k 2.78s, 89.9281 MFLOPS
i, k, j 6.21s, 40.2576 MFLOPS
j, i, k 3.18s, 78.6164 MFLOPS
j, k, i 9.91s, 25.227 MFLOPS
k, i, j 5.63s, 44.405 MFLOPS
k, j, i 9.48s, 26.3713 MFLOPS
Again on a PowerBook G4 1.33GHz

14
Results for 512 by 512

gt matrix_product 512
i, j, k 2.87s, 93.5315 MFLOPS
i, k, j 21.81s, 12.3079 MFLOPS
j, i, k 3.46s, 77.5825 MFLOPS
j, k, i 42.09s, 6.37765 MFLOPS
k, i, j 22.51s, 11.9252 MFLOPS
k, j, i 42.95s, 6.24995 MFLOPS
Why is this so slow?
Will explain this on the black board

15
Results for 544 by 544

gt matrix_product 544
i, j, k 3.57s, 90.19 MFLOPS
i, k, j 8.4s, 38.3308 MFLOPS
j, i, k 4.04s, 79.6976 MFLOPS
j, k, i 13.61s, 23.6575 MFLOPS
k, i, j 7.35s, 43.8066 MFLOPS
k, j, i 13.2s, 24.3923 MFLOPS
This behaves like the first one (500?500)

16
544 by 544 with Single Precision

matrix_product 544
i, j, k 2.17s, 148.377 MFLOPS
i, k, j 4.28s, 75.2286 MFLOPS
j, i, k 2.08s, 154.797 MFLOPS
j, k, i 6.51s, 49.459 MFLOPS
k, i, j 4.33s, 74.3599 MFLOPS
k, j, i 6.36s, 50.6255 MFLOPS
Twice as fast
Because it takes half the time to compute or to
load and store?

17
Row-Major Memory Layout
i
0 1 2
3 4
A
j
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4
j
0 1 2
3 4
B
k
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4
i
0 1 2
3 4
C
k
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4

Loop order (i, j, k) has unit-stride access but

18
Matrix Reload

Assume one matrix is larger than the cache
With loop order (i, j, k)
Matrix B is complete traversed in the 2 inner
loops
In next iteration of outer loop B is reloaded
into cache
Same with loop order (i, k, j)
With j as outer loop index we read C completely
from main memory
With k as outer loop index matrix A has to be
reloaded

19
Blocked Matrix Multiplication
kk

Assume we can partition A, B and C into blocks of
size L?L
Then we can express the multiplication in terms
of vector products of blocks
A(Aii,jj), B(Bjj,kk), C(Cii,kk)

jj
B
jj
C
A
ii
Cii,kk
20
What Do We Win?

If block size is small enough, we can keep a
block of each matrix in cache
Computation within the block loads from and
stores to cache
After the first access to each element at least
Within the block we have
3L2 loads from memory (at most)
3L3 loads from cache, which are fast
L2 stores into memory
L3 stores into cache, also fast
2L3 operations
If L is large enough memory traffic is
negligible!
If L is too large blocks do not fit into cache
-(

21
Program Blocked Matrix Product

fill(c0, cNS, 0.0)
for (int ii 0 ii lt blocks ii)
for (int jj 0 jj lt blocks jj)
for (int kk 0 kk lt blocks kk)
for (int i iiB, i_end (ii1)B i lt
i_end i)
for (int j jjB, j_end (jj1)B j lt
j_end j)
for (int k kkB, k_end (kk1)B k lt
k_end k)
ciNk aiNj bjNk
What other improvements can we make?

22
Blocking of 544 by 544 Matrices

gt blocked_matrix_product 544 32
i, j, k 1.27s, 253.526 MFLOPS
i, k, j 1.75s, 183.988 MFLOPS
j, i, k 0.94s, 342.53 MFLOPS
j, k, i 1.11s, 290.071 MFLOPS
k, i, j 1.77s, 181.909 MFLOPS
k, j, i 1.26s, 255.538 MFLOPS
Is there more we can improve?

23
Unrolling Inner Loop

for (unsigned ii 0 ii lt blocks ii)
for (unsigned jj 0 jj lt blocks jj)
for (unsigned kk 0 kk lt blocks kk)
for (unsigned i iiB, i_end (ii1)B i lt
i_end i)
for (unsigned j jjB, j_end (jj1)B j lt
j_end j)
for (unsigned k kkB, k_end (kk1)B k lt
k_end k k4)
unsigned index_a iNj, index_b jNk,
index_c iNk
cindex_c aindex_a bindex_b
cindex_c1 aindex_a bindex_b1
cindex_c2 aindex_a bindex_b2
cindex_c3 aindex_a bindex_b3

24
Unrolling Dot Product

for (unsigned ii 0 ii lt blocks ii)
for (unsigned kk 0 kk lt blocks kk)
for (unsigned jj 0 jj lt blocks jj)
for (unsigned i iiB, i_end (ii1)B i lt
i_end i)
for (unsigned k kkB, k_end (kk1)B k lt
k_end k)
real_type tmp0 0.0, tmp1 0.0, tmp2 0.0,
tmp3 0.0
for (unsigned j jjB, j_end (jj1)B j lt
j_end j j4)
unsigned index_a iNj, index_b jNk
tmp0 aindex_a bindex_b
tmp1 aindex_a1 bindex_bN
tmp2 aindex_a2 bindex_b2N
tmp3 aindex_a3 bindex_b3N
ciNk tmp0 tmp1 tmp2 tmp3

25
Results of Optimization

Unrolling of scaled vector addition already done
by the compiler, no speedup here
Constant index computation didnt pay off either
Unrolling dot product accelerated
About 630 MFLOPS 1 double op. per 2 cycles
With single precision 650 MFLOPS
We can compute double precision and single
precision at almost the same speed
Given they are already in cache
If you want get faster use optimized libraries

26
More Variants

What if we consider 2 levels of cache?
We need 9 loops
With 3 levels of cache we need 12 loops,
Dual concept the nesting is not part of the
algorithm but of the data structure / memory
layout
Morton-ordered matrices
Prof. Wise will give a lecture about it
Unrolling
Mostly done by compiler
Superscalar dot products
There are very fast libraries for this
Most popular is BLAS
Often optimized by computer vendors

27
Why is Matrix Product so Important?

Top 500 list of fastest computers is based on
Linpack
Contains linear algebra algorithms
Most of the time is spent in matrix products
If one can speedup up (dense) matrix product by
17 one gets a 17 better rating in Top 500
Interest in fast dense matrix products is more
political then scientific
What is more important?
Compute something people need, see next lectures