Title: MA471 Fall 2002, Lecture 8
1MA471Fall 2002, Lecture 8
- Some efficiency related tricks
2Data Point Storage
In Matlab and Fortran
For efficiency reasons we will store the data in
a matrixwith element nodes running along matrix
columns
Elements
1
2
Data Values
3
................................
4
.
.
.
.
.
.
.
.
.
3Data Point Storage
Data Values
1
2
3
In C or C
.
.
For efficiency reasons we will store the data in
a matrix with element node values running along
rows
........................
Elements
.
.
4Serial Optimization
5Serial Optimization
- Four approaches we will consider
- Algorithmic optimization
- Compiler based optimization
- Employ library routines
6Compiler Based Optimization
- In some ways the easiest thing to do
- Simple example
- cc O3 o foo foo.c
- Every compiler has its own set of options (like
-O3) - Tactic
- for machine specific/compiler specific options
- E.g. gt man gcc
- Examples
- -O, -O1,-O2,-O3,-fast
- -mcpupentium
- And so on
- BUILD THE OPTIONS INTO THE Makefile possibly
using a command line switch
7Library Based Optimization
- Use atlas or manufacturer (Intel mkl, IBM essl,
SGI complib, Sun Sunperf) based BLAS for
matrix-matrix multiplication - I highly recommend creating a wrapper around
DGEMM, as it has a very general interface and
wewill only require a small subset of functions - I gave you umLINALG.c which contains a number of
matrix-matrix routines neatly wrapped up.
8Wrapper for C AB
- Recall we are using BLAS routines which are
written in Fortran with Fortran storage in mind - If we have A stored in c then we really have A
stored in Fortran. - We use the following identity
- C BA
- So if we request BLAS to multiply c-stored B and
A together it will actually be multiplying B and
A together. - The result will be BA stored in Fortran
convention - However, BA stored in Fortran convention is
(BA) in c-convention and (BA) AB
9- / C A times B /
- void umAxB(double A, int rowsA, int colsA,
- double B, int rowsB, int
colsB, - double C)
- char opA N , opB N
- double one 1., zero 0
- int i,j,k
- if(colsA ! rowsB) fprintf(stderr, "Warning,
colsA!rowsB in umAxB\n") - ifdef UNDERSCORE
- dgemm_
- else
- dgemm
- endif
- (opA, opB, colsB, rowsA, colsA, one,
B, colsB, A, colsA, zero, C, colsB)
Notice we list B then A
10Wrapper for A(B)
- I included a routine umAxBtrans for this instance
too. - This is the most efficient way to multiply two
matrices since no cross row strides are
necessary. Here is some pseudo-code - for(i0iltNrowsAi)
- for(j0jltNrowsBj)
- double d 0
- for(k0kltNcolsAk)
- d d A i k B j k
-
- C i j d
-
Notice how we only ever accessthe matrix entries
in order of storage
11Note on Linking ATLAS
- Due to the way that the atlas libraries
(cblas,..) wherecompiled you should add the
following to your OBJS variable in the Makefile - xerbla.o
- Then copy xerbla.f from
- /home/cs471aa/umFEM/xerbla.f
- Into your build directory
12Individual Homework
- Write your own matrixmatrix and
matrixtranpose(matrix) routines which do not
call BLAS. Do not explicitly transpose the matrix
for the second version. - For N5,10,15,..,1000
- Create two random matrix of N rows and N columns
called A and B - Time how long it takes to multiply A and B into a
third matrix C one thousand times using your own
routine - Time how long it takes to multiply A and B
transpose into a third matrix C one thousand
times using your own routine - Repeat (2) using umAxB
- Repeat (3) using umAxBtrans
3) Plot the results from all the runs in part 2)
on the same graph with N on the horizontal
axis and average time taken on the vertical axis
using Matlab or similar. Label axes, include
a title and legend
13Algorithmic Optimization
- We have not paid too much attention to cache
locality in the current code. - We have accepted the order of the data as it is
read infrom file. i.e. we store triangle 1, then
triangle 2, - We are free to shuffle the order of the elements
(for example) - Why does this matter?.
14Banded Storage
- It is possible to perform LU factorization on a
band limited matrix using only the storage
required to holdthe non-zero diagonal bands. - i.e if a symmetric matrix only has 10 upper
diagonals with non-zero entries but 10000 rows
then we can save 10000-20 rows of storage. A big
saving!.
15Example
16Cache Miss Minimization
- When we go from an element local node to a
globally numbered node we may end up with a
random walk around memory. - This is not a good thing..
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Reverse Cuthill McKee Algorithm
21(No Transcript)
220 Do
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28Lab
- Continue implementing group project
- If you are waiting for some other piece to be
readied work on the homework.