P573 Scientific Computing Lecture 6 Matrix Operations 2 - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

P573 Scientific Computing Lecture 6 Matrix Operations 2

Description:

This leads us to a 2nd matrix ... So, we yield a 3rd matrix. Alex's car contains 38420 3/8' nuts, 7979034 5/8' bolts, 60 valves, ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 28
Provided by: david3080
Category:

less

Transcript and Presenter's Notes

Title: P573 Scientific Computing Lecture 6 Matrix Operations 2


1
P573Scientific ComputingLecture 6 - Matrix
Operations 2
  • Peter Gottschling
  • pgottsch_at_cs.indiana.edu
  • www.osl.iu.edu/pgottsch/courses/p573-06

2
Overview
  • Definitions
  • Addition
  • Norms
  • Matrix vector product
  • Naïve matrix product
  • Alternative implementations
  • Blocking

3
Motivation Matrix Product
  • Where matrix multiplication comes from?
  • Lets do some unscientific computing
  • Imagine you have lists of objects and of how many
    components they consists of
  • My car consists of a 3l-engine, 4 doors, 4 side
    windows,
  • Alexs car consists of a 15l-engine, 2 doors, 2
    side windows,
  • We can write this as a matrix?

3l engine 15l engine Doors Side windows
Peters car Alexs car Jacobs car
4
Motivation Matrix Product (ii)
  • Now we have a 2nd list
  • How the components are composed of pieces
  • The 3l engine contains 20 3/8 nuts, 12 5/8
    bolts, 12 valves,
  • One door contains 7 3/8 nuts, 5 5/8 bolts, 0
    valves
  • This leads us to a 2nd matrix
  • Now the stunning question How many 5/8 bolts
    contains Alexs car?

3/8 nuts 5/8 bolts Valves
3l engine 15l engine Door Side window
5
Motivation Matrix Product (iii)
  • This is easy to compute
  • We multiply the number of components with the
    number of pieces
  • Then we sum this products over all components
  • Of course we can do this for all cars and all
    components
  • So, we yield a 3rd matrix
  • Alexs car contains 38420 3/8 nuts, 7979034 5/8
    bolts, 60 valves,
  • Jacobs car contains
  • Such things you learn in vocational school not
    universities

3/8 nuts 5/8 bolts Valves
Peters car Alexs car Jacobs car
6
Definition Matrix Product
l
  • Mapping
  • For matrices A(aij), B(bjk), C(bik)
  • The product CAB is defined

B
n
n
C
A
m
Cik
7
Operator Concatenation
  • Recall matrices map from one vector space to
    another
  • Say B maps from V1 to V2
  • x ? V1 ? Bx ? V2
  • Say A maps from V2 to V3
  • y ? V2 ? Ay ? V3
  • Then AB maps from V1 to V3
  • x ? V1 ? ABx ? V3
  • Obviously dim(V2) rows(B) colums(A)

8
Programming Matrix Product
  • Computing dot product for all i and k
  • double a new doubleMN, b new doubleNL,
  • c new doubleML
  • fill(a0, aMN, 2.0) fill(b0, bNL,
    3.0)
  • fill(c0, cML, 0.0)
  • for (unsigned i 0 i lt M i)
  • for (unsigned k 0 k lt L k)
  • for (unsigned j 0 j lt N j)
  • ciLk aiNj bjLk
  • Does the result change if we change the loops?

9
Changing Loop Order
  • No, any order yields correct results, e.g. k, j,
    i
  • fill(c0, cML, 0.0)
  • for (unsigned k 0 k lt L k)
  • for (unsigned j 0 j lt N j)
  • for (unsigned i 0 i lt M i)
  • ciLk aiNj bjLk
  • ?k C(, k) 0 over ? j C(, k) A(, j) bjk
  • One column of C is incremented by a column of A
    times scaled by an element of B
  • This is called daxpy double a?x plus y
  • Or saxpy for single precision, or caxpy for
    complex
  • Or gaxpy in general (generic?)
  • Does this change our compute time?

10
Run Time Analysis
  • Memory traffic
  • A mn loads
  • B nl loads
  • C ml stores
  • Operations
  • Multiplications mnl
  • Additions ml(n-1)
  • For simplicity, we consider m n l
  • 2n2 loads
  • n2 stores
  • 2n3 operations (approximately)
  • Is this what we really do?

11
Run Time Analysis
  • We cant store the whole matrix in processor
  • May be in cache?
  • We have to reload values
  • Even the values of C
  • Unless we compute cik in the innermost loop
  • In reality we have
  • 3n3 loads (instead of 2n2)
  • n3 stores (instead of n2)
  • 2n3 operations
  • How many of them are from or to the memory?
  • And how many from or to the cache?
  • This depends on loop order

12
n2 Vector Operations
  • All 6 variants are composed of vector operations

13
Results for 500 by 500
  • gt matrix_product 500
  • i, j, k 2.78s, 89.9281 MFLOPS
  • i, k, j 6.21s, 40.2576 MFLOPS
  • j, i, k 3.18s, 78.6164 MFLOPS
  • j, k, i 9.91s, 25.227 MFLOPS
  • k, i, j 5.63s, 44.405 MFLOPS
  • k, j, i 9.48s, 26.3713 MFLOPS
  • Again on a PowerBook G4 1.33GHz

14
Results for 512 by 512
  • gt matrix_product 512
  • i, j, k 2.87s, 93.5315 MFLOPS
  • i, k, j 21.81s, 12.3079 MFLOPS
  • j, i, k 3.46s, 77.5825 MFLOPS
  • j, k, i 42.09s, 6.37765 MFLOPS
  • k, i, j 22.51s, 11.9252 MFLOPS
  • k, j, i 42.95s, 6.24995 MFLOPS
  • Why is this so slow?
  • Will explain this on the black board

15
Results for 544 by 544
  • gt matrix_product 544
  • i, j, k 3.57s, 90.19 MFLOPS
  • i, k, j 8.4s, 38.3308 MFLOPS
  • j, i, k 4.04s, 79.6976 MFLOPS
  • j, k, i 13.61s, 23.6575 MFLOPS
  • k, i, j 7.35s, 43.8066 MFLOPS
  • k, j, i 13.2s, 24.3923 MFLOPS
  • This behaves like the first one (500?500)

16
544 by 544 with Single Precision
  • matrix_product 544
  • i, j, k 2.17s, 148.377 MFLOPS
  • i, k, j 4.28s, 75.2286 MFLOPS
  • j, i, k 2.08s, 154.797 MFLOPS
  • j, k, i 6.51s, 49.459 MFLOPS
  • k, i, j 4.33s, 74.3599 MFLOPS
  • k, j, i 6.36s, 50.6255 MFLOPS
  • Twice as fast
  • Because it takes half the time to compute or to
    load and store?

17
Row-Major Memory Layout
i
0 1 2
3 4
A
j
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4
j
0 1 2
3 4
B
k
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4
i
0 1 2
3 4
C
k
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 1 2 3 4
  • Loop order (i, j, k) has unit-stride access but

18
Matrix Reload
  • Assume one matrix is larger than the cache
  • With loop order (i, j, k)
  • Matrix B is complete traversed in the 2 inner
    loops
  • In next iteration of outer loop B is reloaded
    into cache
  • Same with loop order (i, k, j)
  • With j as outer loop index we read C completely
    from main memory
  • With k as outer loop index matrix A has to be
    reloaded

19
Blocked Matrix Multiplication
kk
  • Assume we can partition A, B and C into blocks of
    size L?L
  • Then we can express the multiplication in terms
    of vector products of blocks
  • A(Aii,jj), B(Bjj,kk), C(Cii,kk)

jj
B
jj
C
A
ii
Cii,kk
20
What Do We Win?
  • If block size is small enough, we can keep a
    block of each matrix in cache
  • Computation within the block loads from and
    stores to cache
  • After the first access to each element at least
  • Within the block we have
  • 3L2 loads from memory (at most)
  • 3L3 loads from cache, which are fast
  • L2 stores into memory
  • L3 stores into cache, also fast
  • 2L3 operations
  • If L is large enough memory traffic is
    negligible!
  • If L is too large blocks do not fit into cache
    -(

21
Program Blocked Matrix Product
  • fill(c0, cNS, 0.0)
  • for (int ii 0 ii lt blocks ii)
  • for (int jj 0 jj lt blocks jj)
  • for (int kk 0 kk lt blocks kk)
  • for (int i iiB, i_end (ii1)B i lt
    i_end i)
  • for (int j jjB, j_end (jj1)B j lt
    j_end j)
  • for (int k kkB, k_end (kk1)B k lt
    k_end k)
  • ciNk aiNj bjNk
  • What other improvements can we make?

22
Blocking of 544 by 544 Matrices
  • gt blocked_matrix_product 544 32
  • i, j, k 1.27s, 253.526 MFLOPS
  • i, k, j 1.75s, 183.988 MFLOPS
  • j, i, k 0.94s, 342.53 MFLOPS
  • j, k, i 1.11s, 290.071 MFLOPS
  • k, i, j 1.77s, 181.909 MFLOPS
  • k, j, i 1.26s, 255.538 MFLOPS
  • Is there more we can improve?

23
Unrolling Inner Loop
  • for (unsigned ii 0 ii lt blocks ii)
  • for (unsigned jj 0 jj lt blocks jj)
  • for (unsigned kk 0 kk lt blocks kk)
  • for (unsigned i iiB, i_end (ii1)B i lt
    i_end i)
  • for (unsigned j jjB, j_end (jj1)B j lt
    j_end j)
  • for (unsigned k kkB, k_end (kk1)B k lt
    k_end k k4)
  • unsigned index_a iNj, index_b jNk,
    index_c iNk
  • cindex_c aindex_a bindex_b
  • cindex_c1 aindex_a bindex_b1
  • cindex_c2 aindex_a bindex_b2
  • cindex_c3 aindex_a bindex_b3

24
Unrolling Dot Product
  • for (unsigned ii 0 ii lt blocks ii)
  • for (unsigned kk 0 kk lt blocks kk)
  • for (unsigned jj 0 jj lt blocks jj)
  • for (unsigned i iiB, i_end (ii1)B i lt
    i_end i)
  • for (unsigned k kkB, k_end (kk1)B k lt
    k_end k)
  • real_type tmp0 0.0, tmp1 0.0, tmp2 0.0,
    tmp3 0.0
  • for (unsigned j jjB, j_end (jj1)B j lt
    j_end j j4)
  • unsigned index_a iNj, index_b jNk
  • tmp0 aindex_a bindex_b
  • tmp1 aindex_a1 bindex_bN
  • tmp2 aindex_a2 bindex_b2N
  • tmp3 aindex_a3 bindex_b3N
  • ciNk tmp0 tmp1 tmp2 tmp3

25
Results of Optimization
  • Unrolling of scaled vector addition already done
    by the compiler, no speedup here
  • Constant index computation didnt pay off either
  • Unrolling dot product accelerated
  • About 630 MFLOPS 1 double op. per 2 cycles
  • With single precision 650 MFLOPS
  • We can compute double precision and single
    precision at almost the same speed
  • Given they are already in cache
  • If you want get faster use optimized libraries

26
More Variants
  • What if we consider 2 levels of cache?
  • We need 9 loops
  • With 3 levels of cache we need 12 loops,
  • Dual concept the nesting is not part of the
    algorithm but of the data structure / memory
    layout
  • Morton-ordered matrices
  • Prof. Wise will give a lecture about it
  • Unrolling
  • Mostly done by compiler
  • Superscalar dot products
  • There are very fast libraries for this
  • Most popular is BLAS
  • Often optimized by computer vendors

27
Why is Matrix Product so Important?
  • Top 500 list of fastest computers is based on
    Linpack
  • Contains linear algebra algorithms
  • Most of the time is spent in matrix products
  • If one can speedup up (dense) matrix product by
    17 one gets a 17 better rating in Top 500
  • Interest in fast dense matrix products is more
    political then scientific
  • What is more important?
  • Compute something people need, see next lectures
Write a Comment
User Comments (0)
About PowerShow.com