Title: P573 Scientific Computing Lecture 6 Matrix Operations 2
1P573Scientific ComputingLecture 6 - Matrix
Operations 2
- Peter Gottschling
- pgottsch_at_cs.indiana.edu
- www.osl.iu.edu/pgottsch/courses/p573-06
2Overview
- Definitions
- Addition
- Norms
- Matrix vector product
- Naïve matrix product
- Alternative implementations
- Blocking
3Motivation Matrix Product
- Where matrix multiplication comes from?
- Lets do some unscientific computing
- Imagine you have lists of objects and of how many
components they consists of - My car consists of a 3l-engine, 4 doors, 4 side
windows, - Alexs car consists of a 15l-engine, 2 doors, 2
side windows, - We can write this as a matrix?
3l engine 15l engine Doors Side windows
Peters car Alexs car Jacobs car
4Motivation Matrix Product (ii)
- Now we have a 2nd list
- How the components are composed of pieces
- The 3l engine contains 20 3/8 nuts, 12 5/8
bolts, 12 valves, - One door contains 7 3/8 nuts, 5 5/8 bolts, 0
valves - This leads us to a 2nd matrix
- Now the stunning question How many 5/8 bolts
contains Alexs car?
3/8 nuts 5/8 bolts Valves
3l engine 15l engine Door Side window
5Motivation Matrix Product (iii)
- This is easy to compute
- We multiply the number of components with the
number of pieces - Then we sum this products over all components
- Of course we can do this for all cars and all
components - So, we yield a 3rd matrix
- Alexs car contains 38420 3/8 nuts, 7979034 5/8
bolts, 60 valves, - Jacobs car contains
- Such things you learn in vocational school not
universities
3/8 nuts 5/8 bolts Valves
Peters car Alexs car Jacobs car
6Definition Matrix Product
l
- Mapping
- For matrices A(aij), B(bjk), C(bik)
- The product CAB is defined
B
n
n
C
A
m
Cik
7Operator Concatenation
- Recall matrices map from one vector space to
another - Say B maps from V1 to V2
- x ? V1 ? Bx ? V2
- Say A maps from V2 to V3
- y ? V2 ? Ay ? V3
- Then AB maps from V1 to V3
- x ? V1 ? ABx ? V3
- Obviously dim(V2) rows(B) colums(A)
8Programming Matrix Product
- Computing dot product for all i and k
- double a new doubleMN, b new doubleNL,
- c new doubleML
- fill(a0, aMN, 2.0) fill(b0, bNL,
3.0) - fill(c0, cML, 0.0)
- for (unsigned i 0 i lt M i)
- for (unsigned k 0 k lt L k)
- for (unsigned j 0 j lt N j)
- ciLk aiNj bjLk
- Does the result change if we change the loops?
9Changing Loop Order
- No, any order yields correct results, e.g. k, j,
i - fill(c0, cML, 0.0)
- for (unsigned k 0 k lt L k)
- for (unsigned j 0 j lt N j)
- for (unsigned i 0 i lt M i)
- ciLk aiNj bjLk
- ?k C(, k) 0 over ? j C(, k) A(, j) bjk
- One column of C is incremented by a column of A
times scaled by an element of B - This is called daxpy double a?x plus y
- Or saxpy for single precision, or caxpy for
complex - Or gaxpy in general (generic?)
- Does this change our compute time?
10Run Time Analysis
- Memory traffic
- A mn loads
- B nl loads
- C ml stores
- Operations
- Multiplications mnl
- Additions ml(n-1)
- For simplicity, we consider m n l
- 2n2 loads
- n2 stores
- 2n3 operations (approximately)
- Is this what we really do?
11Run Time Analysis
- We cant store the whole matrix in processor
- May be in cache?
- We have to reload values
- Even the values of C
- Unless we compute cik in the innermost loop
- In reality we have
- 3n3 loads (instead of 2n2)
- n3 stores (instead of n2)
- 2n3 operations
- How many of them are from or to the memory?
- And how many from or to the cache?
- This depends on loop order
12 n2 Vector Operations
- All 6 variants are composed of vector operations
13Results for 500 by 500
- gt matrix_product 500
- i, j, k 2.78s, 89.9281 MFLOPS
- i, k, j 6.21s, 40.2576 MFLOPS
- j, i, k 3.18s, 78.6164 MFLOPS
- j, k, i 9.91s, 25.227 MFLOPS
- k, i, j 5.63s, 44.405 MFLOPS
- k, j, i 9.48s, 26.3713 MFLOPS
- Again on a PowerBook G4 1.33GHz
14Results for 512 by 512
- gt matrix_product 512
- i, j, k 2.87s, 93.5315 MFLOPS
- i, k, j 21.81s, 12.3079 MFLOPS
- j, i, k 3.46s, 77.5825 MFLOPS
- j, k, i 42.09s, 6.37765 MFLOPS
- k, i, j 22.51s, 11.9252 MFLOPS
- k, j, i 42.95s, 6.24995 MFLOPS
- Why is this so slow?
- Will explain this on the black board
15Results for 544 by 544
- gt matrix_product 544
- i, j, k 3.57s, 90.19 MFLOPS
- i, k, j 8.4s, 38.3308 MFLOPS
- j, i, k 4.04s, 79.6976 MFLOPS
- j, k, i 13.61s, 23.6575 MFLOPS
- k, i, j 7.35s, 43.8066 MFLOPS
- k, j, i 13.2s, 24.3923 MFLOPS
- This behaves like the first one (500?500)
16544 by 544 with Single Precision
- matrix_product 544
- i, j, k 2.17s, 148.377 MFLOPS
- i, k, j 4.28s, 75.2286 MFLOPS
- j, i, k 2.08s, 154.797 MFLOPS
- j, k, i 6.51s, 49.459 MFLOPS
- k, i, j 4.33s, 74.3599 MFLOPS
- k, j, i 6.36s, 50.6255 MFLOPS
- Twice as fast
- Because it takes half the time to compute or to
load and store?
17Row-Major Memory Layout
i
0 1 2
3 4
A
j
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4
j
0 1 2
3 4
B
k
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4
i
0 1 2
3 4
C
k
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4
- Loop order (i, j, k) has unit-stride access but
18Matrix Reload
- Assume one matrix is larger than the cache
- With loop order (i, j, k)
- Matrix B is complete traversed in the 2 inner
loops - In next iteration of outer loop B is reloaded
into cache - Same with loop order (i, k, j)
- With j as outer loop index we read C completely
from main memory - With k as outer loop index matrix A has to be
reloaded
19Blocked Matrix Multiplication
kk
- Assume we can partition A, B and C into blocks of
size L?L - Then we can express the multiplication in terms
of vector products of blocks - A(Aii,jj), B(Bjj,kk), C(Cii,kk)
jj
B
jj
C
A
ii
Cii,kk
20What Do We Win?
- If block size is small enough, we can keep a
block of each matrix in cache - Computation within the block loads from and
stores to cache - After the first access to each element at least
- Within the block we have
- 3L2 loads from memory (at most)
- 3L3 loads from cache, which are fast
- L2 stores into memory
- L3 stores into cache, also fast
- 2L3 operations
- If L is large enough memory traffic is
negligible! - If L is too large blocks do not fit into cache
-(
21Program Blocked Matrix Product
- fill(c0, cNS, 0.0)
- for (int ii 0 ii lt blocks ii)
- for (int jj 0 jj lt blocks jj)
- for (int kk 0 kk lt blocks kk)
- for (int i iiB, i_end (ii1)B i lt
i_end i) - for (int j jjB, j_end (jj1)B j lt
j_end j) - for (int k kkB, k_end (kk1)B k lt
k_end k) - ciNk aiNj bjNk
- What other improvements can we make?
22Blocking of 544 by 544 Matrices
- gt blocked_matrix_product 544 32
- i, j, k 1.27s, 253.526 MFLOPS
- i, k, j 1.75s, 183.988 MFLOPS
- j, i, k 0.94s, 342.53 MFLOPS
- j, k, i 1.11s, 290.071 MFLOPS
- k, i, j 1.77s, 181.909 MFLOPS
- k, j, i 1.26s, 255.538 MFLOPS
-
- Is there more we can improve?
23Unrolling Inner Loop
- for (unsigned ii 0 ii lt blocks ii)
- for (unsigned jj 0 jj lt blocks jj)
- for (unsigned kk 0 kk lt blocks kk)
- for (unsigned i iiB, i_end (ii1)B i lt
i_end i) - for (unsigned j jjB, j_end (jj1)B j lt
j_end j) - for (unsigned k kkB, k_end (kk1)B k lt
k_end k k4) - unsigned index_a iNj, index_b jNk,
index_c iNk - cindex_c aindex_a bindex_b
- cindex_c1 aindex_a bindex_b1
- cindex_c2 aindex_a bindex_b2
- cindex_c3 aindex_a bindex_b3
-
24Unrolling Dot Product
- for (unsigned ii 0 ii lt blocks ii)
- for (unsigned kk 0 kk lt blocks kk)
- for (unsigned jj 0 jj lt blocks jj)
- for (unsigned i iiB, i_end (ii1)B i lt
i_end i) - for (unsigned k kkB, k_end (kk1)B k lt
k_end k) - real_type tmp0 0.0, tmp1 0.0, tmp2 0.0,
tmp3 0.0 - for (unsigned j jjB, j_end (jj1)B j lt
j_end j j4) - unsigned index_a iNj, index_b jNk
- tmp0 aindex_a bindex_b
- tmp1 aindex_a1 bindex_bN
- tmp2 aindex_a2 bindex_b2N
- tmp3 aindex_a3 bindex_b3N
- ciNk tmp0 tmp1 tmp2 tmp3
25Results of Optimization
- Unrolling of scaled vector addition already done
by the compiler, no speedup here - Constant index computation didnt pay off either
- Unrolling dot product accelerated
- About 630 MFLOPS 1 double op. per 2 cycles
- With single precision 650 MFLOPS
- We can compute double precision and single
precision at almost the same speed - Given they are already in cache
- If you want get faster use optimized libraries
26More Variants
- What if we consider 2 levels of cache?
- We need 9 loops
- With 3 levels of cache we need 12 loops,
- Dual concept the nesting is not part of the
algorithm but of the data structure / memory
layout - Morton-ordered matrices
- Prof. Wise will give a lecture about it
- Unrolling
- Mostly done by compiler
- Superscalar dot products
- There are very fast libraries for this
- Most popular is BLAS
- Often optimized by computer vendors
27Why is Matrix Product so Important?
- Top 500 list of fastest computers is based on
Linpack - Contains linear algebra algorithms
- Most of the time is spent in matrix products
- If one can speedup up (dense) matrix product by
17 one gets a 17 better rating in Top 500 - Interest in fast dense matrix products is more
political then scientific - What is more important?
- Compute something people need, see next lectures