Title: Software Project: Fast matrix multiplication Cache usage Make Debugging
1Software ProjectFast matrix multiplication
Cache usage Make Debugging
2Administration
- TA Alex Shulman
- E-mail shulmana_at_post.tau.ac.il
- Office Hours Thursday 1100 1200
- Location System Help Desk
- TA Andrei Sharf
- E-mail asharf_at_post.tau.ac.il
- Office Hours Sunday 1100 1200
- Location Schreiber 002
- Website
- http//www.cs.tau.ac.il/shulmana/courses/soft-pro
ject/
3Overview
- Home Exercise
- Fast Matrix Multiplication
- The simple algorithm
- Changing the loop order
- Blocking
- Supplementary Material
- Cache
- Timer
- Makefile
- Unix profiler (gprof)
- Debugger
4Multiplication of 2D Matrices
- Time and Performance Measurement
- Simple Code Improvements
5Matrix Multiplication
A
B
6Matrix Multiplication
A
B
7The simplest algorithm
Assumption the matrices are stored as 2-D NxN
arrays
- for (i0iltNi)
- for (j0jltNj)
- for (k0kltNk) cij
aik bkj -
Advantage code simplicity Disadvantage
performance
8First Improvement
- for (i0iltNi)
- for (j0jltNj)
- for (k0kltNk) cij aik
bkj -
- cij is constant in the k-loop!
9First Performance Improvement
- for (i0iltNi)
- for (j0jltNj)
- int sum 0
- for (k0kltNk)
- sum aik bkj
-
- cij sum
-
-
10Measuring the performance
- clock_t clock(void) Returns the processor time
used by the program since the beginning of
execution, or -1 if unavailable. - clock()/CLOCKS_PER_SEC - is a time in seconds.
-
include lttime.hgt clock_t t1,t2 t1
clock() mult_ijk(a,b,c,n) t2 clock()
printf("The running time is lf seconds\n",
(double)(t2 - t1)/(CLOCKS_PER_SEC))
11The running time
The simplest algorithm
After the first optimization
12Profiling with gprof
- gprof is a profiling program which collects and
arranges statistics on your programs. - Profiling allows you to learn where your program
spent its time. - This information can show you which pieces of
your program are slower and might be candidates
for rewriting to make your program execute
faster. - It can also tell you which functions are being
called more often than you expected.
13The Unix gprof profiler
- To run the profiler do
- Add the flag pg to compilation and linking
- Run your program
- Run the profiler
14The Unix gprof profiler
15Cache Memory
16General Idea
- You are in the library gathering books for an
assignment - The books you have gathered contain material that
you will likely use. - You do not collect ALL the books from the library
to your desk. - It is quicker to access information from the book
on your desk than to go to stack again. - This is like use of cache principles in computing.
17Cache Types
- There are many different types of caches
associated with your computer system - browser cache (for the recent websites you've
visited) - memory caches
- hard drive caches
- A cache is meant to improve access times and
enhance the overall performance of your computer.
- The type we're concerned with today is cache
memory.
18Memory Hierarchy
CPU
- decrease cost per bit
- decrease frequency of access
- increase capacity
- increase access time
- increase size of transfer unit
word transfer
cache
block transfer
main memory
disks
- The memory cache is closer to the processor than
the main memory. - It is smaller and faster than the main memory.
- Transfer between caches and main memory is
performed in units called cache blocks/lines.
19Types of Cache Misses
- 1. Compulsory misses Cache misses caused by the
first access to the block that has never been in
cache (also known as cold-start misses) - 2. Capacity misses Cache misses caused when the
cache cannot contain all the blocks needed during
execution of a program. Occur because of blocks
being replaced and later retrieved when
accessed. - 3. Conflict misses Cache misses that occur when
multiple blocks compete for the same set.
20Main Cache Principles
- Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon. - Keep more recently accessed data items closer to
the processor - Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon. - Move blocks consists of contiguous words to the
cache
21Improving Spatial LocalityLoop Reordering for
Matrices Allocated by Row
Allocation by rows
22Writing Cache Friendly Code
- Example
- 4-byte words, 4-word cache blocks
int sumarrayrows(int aMN) int i, j, sum
0 for (i 0 i lt M i) for (j
0 j lt N j) sum aij
return sum
int sumarraycols(int aMN) int i, j, sum
0 for (j 0 j lt N j) for (i
0 i lt M i) sum aij
return sum
Accesses distant elements no spatial locality!
Accesses successive elements.
23Matrix Multiplication (ijk)
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0 for (k0 kltn k)
sum aik bkj cij sum
Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise
- Misses per Inner Loop Iteration
- A B C
- 0.25 1.0 0.0
24Matrix Multiplication (jik)
/ jik / for (j0 jltn j) for (i0 iltn
i) sum 0 for (k0 kltn k)
sum aik bkj cij sum
Inner loop
(,j)
(i,j)
(i,)
A
B
C
- Misses per Inner Loop Iteration
- A B C
- 0.25 1.0 0.0
25Matrix Multiplication (kij)
/ kij / for (k0 kltn k) for (i0 iltn
i) r aik for (j0 jltn j)
cij r bkj
Inner loop
(i,k)
(k,)
(i,)
A
B
C
- Misses per Inner Loop Iteration
- A B C
- 0.0 0.25 0.25
26Matrix Multiplication (jki)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C
- Misses per Inner Loop Iteration
- A B C
- 1.0 0.0 1.0
27Summary
- ijk ( jik)
- misses/iter 1.25
- kij ( ikj)
- misses/iter 0.5
- jki ( kji)
- misses/iter 2.0
for (j0 jltn j) for (k0 kltn k)
r bkj for (i0 iltn i)
cij aik r
for (i0 iltn i) for (j0 jltn j)
sum 0 for (k0 kltn k)
sum aik bkj
cij sum
for (k0 kltn k) for (i0 iltn i)
r aik for (j0 jltn j)
cij r bkj
28Cache Misses Analysis
- Assumptions
- the cache can hold M words of memory on blocks of
size B, . - The replacement policy is LRU (Least Recently
Used). - Compulsory Misses
- When the computation begins the elements of all
the matrices will be brought into cache - Num. of cache misses
29Cache Misses Analysis
The matrix B is scanned n times. Misses per
iteration Overall misses (Since the scans are
in the same order )
Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise
The lower bound for 3 matrices (kij)
A and C
B
30Improving Temporal Locality Blocked Matrix
Multiplication
31Blocked Matrix Multiplication
j
cache block
j
i
i
A
B
C
Key idea reuse the other elements in each cache
block as much as possible
32Blocked Matrix Multiplication
j
cache block
i
i
b elements
cij
cij1
b elements
A
B
C
j
- Since one loads column j1 of B in the cache
lines anyway compute cij1. - Reorder the operations
- compute the first b terms of cij, compute
the first b terms of cij1 - compute the next b terms of cij, compute
the next b terms of cij1 - .....
33Blocked Matrix Multiplication
j
j
cache block
i
i
A
B
C
Compute a whole subrow of C, with the same
reordering of the operations. But then one has
to load all columns of B, which one has to do
again for computing the next row of C. Idea
reuse the blocks of B that we have just loaded.
34Blocked Matrix Multiplication
cache block
j
j
i
i
A
B
C
Order of the operation Compute the first b
terms of all cij values in the C block Compute
the next b terms of all cij values in the C
block . . . Compute the last b terms of all cij
values in the C block
35Blocked Matrix Multiplication
36Blocked Matrix Multiplication
C11
C12
C13
C14
A11
A12
A13
A14
B11
B12
B13
B14
C21
C22
C23
C24
A21
A22
A23
A24
B21
B22
B23
B24
C31
C32
C43
C34
A31
A32
A33
A34
B32
B32
B33
B34
C41
C42
C43
C44
A41
A42
A43
A144
B41
B42
B43
B44
N 4 b
- C22 A21B12 A22B22 A23B32 A24B42
-
- 4 matrix multiplications
- 4 matrix additions
- Main Point each multiplication operates on
small block matrices, whose size may be chosen
so that they fit in the cache.
37Blocked Algorithm
- The blocked version of the i-j-k algorithm is
written simply as - for (i0iltN/Bi)
- for (j0jltN/Bj)
- for (k0kltN/Bk)
- Cij AikBkj
- where B is the block size (which we assume
divides N) - where Xij is the block of matrix X on block
row i and block column j - where means matrix addition
- where means matrix multiplication
38Maximum Block Size
- The blocking optimization only works if the
blocks fit in cache. - That is, 3 blocks of size bxb must fit in memory
(for A, B, and C) - Let M be the cache size (in elements)
- We must have 3b2 M, or b v(M/3)
- Therefore, in the best case, ratio of number of
operations to slow memory accesses is v(M/3)
39Home Exercise
40Home exercise
- Implement the described algorithms for matrix
multiplication and measure the performance. - Store the matrices as arrays, organized by
columns!!!
41Question 1.1 mlpl
- Implement all the 6 options of loop ordering
(ijk, ikj, jik, jki, kij, kji). - Run them for matrices of different sizes.
- Measure the performance with clock() and gprof.
- Select the most efficient loop ordering.
- Plot the running times of all the options (ijk,
jki, etc.) as the function of matrix size.
42Question 1.2 block_mlpl
- Implement the blocking algorithm.
- Run it for matrices of different sizes.
- Measure the performance with clock() and gprof.
- Use the most efficient loop ordering from 1.1.
- Plot the running times in CPU ticks as the
function of matrix size.
43User Interface
- Input
- Case 1 0 or negative
- Case 2 A positive integer number followed by a
matrix values - Output
- Case 1 Running times
- Case 2 A matrix, which is the square of the
input one.
44Files and locations
- All of your files should be located under your
home directory /soft-project/assign1/mlpl - The source files and the executable should match
the exercise name (e.g. mlpl, mlpl.c) - Strictly follow the provided prototypes and the
file framework.
45Emacs and Compilations
- Command gcc hello.c o hello
- Executable hello
- Recommended gcc Wall hello.c o hello
- Tip Use F9 to compile from Emacs
46The Makefile
- mlpl allocate_free.c matrix_manipulate.c
multiply.c mlpl.c - gcc -Wall -g -pg allocate_free.c
matrix_manipulate.c multiply.c mlpl.c -o mlpl - block_mlpl allocate_free.c matrix_manipulate.c
multiply.c block_mlpl.c - gcc -Wall -g -pg allocate_free.c
matrix_manipulate.c multiply.c block_mlpl.c -o
block_mlpl
Commands make mlpl will create the executable
mlpl for 1.1 make block_mlpl - will create the
executable block_mlpl for 1.2
47Plotting the graphs
- Save the output to .csv file.
- Open it in Excel.
- Use the Excels Chart Wizard to plot the data
as the XY Scatter. - X-axis the matrix sizes.
- Y-axis the running time in CPU ticks.
48DDD Debugger
49DDD Debugger Notes
- Compile the code with the g flag
- gt gcc -Wall -g casting.c -o casting
- Run DDD
- gt ddd casting
50Final Notes
- Arrays and Pointers
- The expressions below are equivalent
- int a
- int a
51Good Luck in the Exercise!!!