MA471 Fall 2002, Lecture 8

About This Presentation

Title:

MA471 Fall 2002, Lecture 8

Description:

For efficiency reasons we will store the data in a matrix ... Tactic: for machine specific/compiler specific options; E.g. man gcc. Examples: ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 29

Provided by: marlenem9

Learn more at: https://www.caam.rice.edu

Category:

more less

Transcript and Presenter's Notes

Title: MA471 Fall 2002, Lecture 8

1
MA471Fall 2002, Lecture 8

Some efficiency related tricks

2
Data Point Storage
In Matlab and Fortran
For efficiency reasons we will store the data in
a matrixwith element nodes running along matrix
columns
Elements
1
2
Data Values
3
................................
4
.
.
.
.
.
.
.
.
.
3
Data Point Storage
Data Values

1
2
3

In C or C
.
.
For efficiency reasons we will store the data in
a matrix with element node values running along
rows
........................
Elements

.
.
4
Serial Optimization
5
Serial Optimization

Four approaches we will consider
Algorithmic optimization
Compiler based optimization
Employ library routines

6
Compiler Based Optimization

In some ways the easiest thing to do
Simple example
cc O3 o foo foo.c
Every compiler has its own set of options (like
-O3)
Tactic
for machine specific/compiler specific options
E.g. gt man gcc
Examples
-O, -O1,-O2,-O3,-fast
-mcpupentium
And so on
BUILD THE OPTIONS INTO THE Makefile possibly
using a command line switch

7
Library Based Optimization

Use atlas or manufacturer (Intel mkl, IBM essl,
SGI complib, Sun Sunperf) based BLAS for
matrix-matrix multiplication
I highly recommend creating a wrapper around
DGEMM, as it has a very general interface and
wewill only require a small subset of functions
I gave you umLINALG.c which contains a number of
matrix-matrix routines neatly wrapped up.

8
Wrapper for C AB

Recall we are using BLAS routines which are
written in Fortran with Fortran storage in mind
If we have A stored in c then we really have A
stored in Fortran.
We use the following identity
C BA
So if we request BLAS to multiply c-stored B and
A together it will actually be multiplying B and
A together.
The result will be BA stored in Fortran
convention
However, BA stored in Fortran convention is
(BA) in c-convention and (BA) AB

/ C A times B /
void umAxB(double A, int rowsA, int colsA,
double B, int rowsB, int
colsB,
double C)
char opA N , opB N
double one 1., zero 0
int i,j,k
if(colsA ! rowsB) fprintf(stderr, "Warning,
colsA!rowsB in umAxB\n")
ifdef UNDERSCORE
dgemm_
else
dgemm
endif
(opA, opB, colsB, rowsA, colsA, one,
B, colsB, A, colsA, zero, C, colsB)

Notice we list B then A
10
Wrapper for A(B)

I included a routine umAxBtrans for this instance
too.
This is the most efficient way to multiply two
matrices since no cross row strides are
necessary. Here is some pseudo-code
for(i0iltNrowsAi)
for(j0jltNrowsBj)
double d 0
for(k0kltNcolsAk)
d d A i k B j k
C i j d

Notice how we only ever accessthe matrix entries
in order of storage
11
Note on Linking ATLAS

Due to the way that the atlas libraries
(cblas,..) wherecompiled you should add the
following to your OBJS variable in the Makefile
xerbla.o
Then copy xerbla.f from
/home/cs471aa/umFEM/xerbla.f
Into your build directory

12
Individual Homework

Write your own matrixmatrix and
matrixtranpose(matrix) routines which do not
call BLAS. Do not explicitly transpose the matrix
for the second version.
For N5,10,15,..,1000
Create two random matrix of N rows and N columns
called A and B
Time how long it takes to multiply A and B into a
third matrix C one thousand times using your own
routine
Time how long it takes to multiply A and B
transpose into a third matrix C one thousand
times using your own routine
Repeat (2) using umAxB
Repeat (3) using umAxBtrans

3) Plot the results from all the runs in part 2)
on the same graph with N on the horizontal
axis and average time taken on the vertical axis
using Matlab or similar. Label axes, include
a title and legend
13
Algorithmic Optimization

We have not paid too much attention to cache
locality in the current code.
We have accepted the order of the data as it is
read infrom file. i.e. we store triangle 1, then
triangle 2,
We are free to shuffle the order of the elements
(for example)
Why does this matter?.

14
Banded Storage

It is possible to perform LU factorization on a
band limited matrix using only the storage
required to holdthe non-zero diagonal bands.
i.e if a symmetric matrix only has 10 upper
diagonals with non-zero entries but 10000 rows
then we can save 10000-20 rows of storage. A big
saving!.

15
Example
16
Cache Miss Minimization

When we go from an element local node to a
globally numbered node we may end up with a
random walk around memory.
This is not a good thing..

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Reverse Cuthill McKee Algorithm
21
(No Transcript)
22
0 Do
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Lab