Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply

About This Presentation

Title:

Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply

Description:

demmel_at_cs.berkeley.edu * CS267 Lecture 2 * Paper: ... Sun Ultra2 Model 2200. SGI R8000ip21. Vendor BLAS. Reference BLAS. DGEMM. DSYMM. DSYR2K. DSYRK. DTRMM. DTRSM. Vendor. HP9000/735/125. – PowerPoint PPT presentation

Number of Views:379

Avg rating:3.0/5.0

Slides: 91

Provided by: microsof83

Category:

more less

Transcript and Presenter's Notes

Title: Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Dense Linear Algebra: Optimizing Sequential Matrix Multiply

1
Parallelism and Locality in Matrix
Computationswww.cs.berkeley.edu/demmel/cs267_Sp
r09Dense Linear AlgebraOptimizing Sequential
Matrix Multiply

James Demmel
EECS Math Departments, UC Berkeley
demmel_at_cs.berkeley.edu

2
Outline of Dense Linear Algebra

A brief history of linear algebra software
Optimizing Sequential Matrix Multiplication
(MatMul)
Optimizing Parallel MatMul
Optimizing Gaussian Elimination
Lower bounds on communication (moving data)

3
A brief history of (Dense) Linear Algebra
software (1/6)

In the beginning was the do-loop

Libraries like EISPACK (for eigenvalue problems)
Then the BLAS (1) were invented (1973-1977)
Standard library of 15 operations (mostly) on
vectors
AXPY ( y ax y ), dot product, scale (x
ax ), etc
Up to 4 versions of each (S/D/C/Z), 46 routines,
3300 LOC
Goals
Common pattern to ease programming,
readability, self-documentation
Robustness, via careful coding (avoiding
over/underflow)
Portability Efficiency via machine specific
implementations
Why BLAS 1 ? They do O(n1) ops on O(n1) data
Used in libraries like LINPACK (for linear
systems)
Source of the name LINPACK Benchmark (not the
code!)

4
A brief history of (Dense) Linear Algebra
software (2/6)

But the BLAS-1 werent enough
Consider AXPY ( y ax y ) 2n flops on 3n
read/writes
Computational intensity flops / mem_refs
(2n)/(3n) 2/3
Too low to run near peak speed (time for mem_refs
dominates)
Hard to vectorize (SIMDize) on supercomputers
of the day (1980s)
So the BLAS-2 were invented (1984-1986)
Standard library of 25 operations (mostly) on
matrix/vector pairs
GEMV y aAx ßx, GER A A axyT,
TRSV y T-1x
Up to 4 versions of each (S/D/C/Z), 66 routines,
18K LOC
Why BLAS 2 ? They do O(n2) ops on O(n2) data
So computational intensity still just (2n2)/(n2)
2
OK for vector machines, but not for machine with
caches

5
A brief history of (Dense) Linear Algebra
software (3/6)

The next step BLAS-3 (1987-1988)
Standard library of 9 operations (mostly) on
matrix/matrix pairs
GEMM C aAB ßC, SYRK C aAAT
ßC, TRSM C T-1B
Up to 4 versions of each (S/D/C/Z), 30 routines,
10K LOC
Why BLAS 3 ? They do O(n3) ops on O(n2) data
So computational intensity (2n3)/(4n2) n/2
big at last!
Tuning opportunities machines with caches, other
mem. hierarchy levels
How much BLAS1/2/3 code so far (all at
www.netlib.org/blas)
Source 142 routines, 31K LOC, Testing 28K
LOC
Reference (unoptimized) implementation only
Ex 3 nested loops for GEMM
Lots more optimized code
Most computer vendors provide own optimized
versions
Motivates automatic tuning of the BLAS (details
later)

6
A brief history of (Dense) Linear Algebra
software (4/6)

LAPACK Linear Algebra PACKage - uses BLAS-3
(1989 now)
Ex Obvious way to express Gaussian Elimination
(GE) is adding multiples of one row to other rows
BLAS-1
How do we reorganize GE to use BLAS-3 ? (details
later)
Contents of LAPACK (summary)
Algorithms we can turn into (nearly) 100 BLAS 3
Linear Systems solve Axb for x
Least Squares choose x to minimize r2 ? ?S
ri2 where rAx-b
Algorithms we can only make up to 50 BLAS 3
(so far)
Eigenproblems Find l and x where Ax l x
Singular Value Decomposition (SVD) ATAx?2x
Error bounds for everything
Lots of variants depending on As structure
(banded, AAT, etc)
How much code? (Release 3.2, Nov 2008)
(www.netlib.org/lapack)
Source 1582 routines, 490K LOC, Testing 352K
LOC
Ongoing development (at UCB, UTK and elsewhere)

7
What could go into a linear algebra library?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate!
8
A brief history of (Dense) Linear Algebra
software (5/6)

Is LAPACK parallel?
Only if the BLAS are parallel (possible in shared
memory)
ScaLAPACK Scalable LAPACK (1995 now)
For distributed memory uses MPI
More complex data structures, algorithms than
LAPACK
Only (small) subset of LAPACKs functionality
available
Details later (class projects!)
All at www.netlib.org/scalapack

9
Success Stories for Sca/LAPACK

Widely used
Adopted by Mathworks, Cray, Fujitsu, HP, IBM,
IMSL, Intel, NAG, NEC, SGI,
gt100M web hits(in 2009, 56M in 2006) _at_ Netlib
(incl. CLAPACK, LAPACK95)
New Science discovered through the solution of
dense matrix systems
Nature article on the flat universe used
ScaLAPACK
Other articles in Physics Review B that also use
it
1998 Gordon Bell Prize
www.nersc.gov/news/reports/newNERSCresults050703.p
df

Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
ScaLAPACK
10
A brief history/future of (Dense) Linear Algebra
software (6/6)

Versions for GPUs some algorithms change
significantly
PLASMA extensions to multicore
Communication-minimizing Linear Algebra
Lower bounds on data movement for all direct
linear algebra
LU, QR, eig, svd, dense sparse
sequential parallel
New algorithms that attain these lower bounds
Can one software infrastructure accommodate all
algorithms and platforms of current (future)
interest?
How much code generation and tuning can we
automate?
MAGMA planned PLASMA extension to
multicore/GPU/heterogeneous
FLAME (www.cs.utexas.edu/users/flame/)
Related projects
BLAST Forum (www.netlib.org/blas/blast-forum)
BLAS extensions
Other languages, add some new functions, sparse
matrices, extra-precision
Extra precision BLAS in latest LAPACK release

11
Organizing Linear Algebra in books
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.cs.utk.edu/dongarra/etemplates
www.netlib.org/templates
12
Outline of Dense Linear Algebra

A brief history of linear algebra software
Optimizing Sequential Matrix Multiplication
(MatMul)
Simple performance models to understand
performance
Warm-up Matrix-vector multiplication
Reducing memory traffic for MatMul
Lower bound on memory traffic for MatMul
Other variants on MatMul
Automatic vs Hand-Tuning of Algorithms
Optimizing Parallel MatMul
Optimizing Gaussian Elimination
Lower bounds on communication (moving data)

13
Why Matrix Multiplication?

An important kernel in many problems
Used in many linear algebra algorithms
Closely related to other algorithms, e.g.,
transitive closure on a graph using
Floyd-Warshall
Optimization ideas can be used in other problems
The best case for optimization payoffs
The most-studied algorithm in high performance
computing

14
Matrix-multiply, optimized several ways
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
15
Note on Matrix Storage

A matrix is (usually) a 2-D array of elements,
but memory addresses are 1-D
Conventions for matrix layout
by column, or column major (Fortran default)
A(i,j) at Aijn
by row, or row major (C default) A(i,j) at
Ainj
recursive (later)
Column major (for now)

Column major matrix in memory
Row major
Column major
0
5
10
15
0
1
2
3
1
6
11
16
4
5
6
7
2
7
12
17
8
9
10
11
3
8
13
18
12
13
14
15
4
9
14
19
16
17
18
19
Blue row of matrix is stored in red cachelines
cachelines
Figure source Larry Carter, UCSD
16
Using a Simple Model of Memory to Optimize

Assume just 2 levels in memory hierarchy, fast
and slow
All data initially in slow memory
m number of memory elements (words) moved
between fast and slow memory
tm time per slow memory operation
f number of arithmetic operations
tf time per arithmetic operation ltlt tm
q f / m average number of flops per slow
memory access
Minimum possible time f tf when all data in
fast memory
Actual time
f tf m tm f tf (1 tm/tf 1/q)
Larger q means time closer to minimum f tf
q ? tm/tf needed to get at least half of peak
speed

17
Warm up Matrix-vector multiplication

implements y y Ax
for i 1n
for j 1n
y(i) y(i) A(i,j)x(j)

A(i,)

y(i)
y(i)
x()
18
Warm up Matrix-vector multiplication

read x(1n) into fast memory
read y(1n) into fast memory
for i 1n
read row i of A into fast memory
for j 1n
y(i) y(i) A(i,j)x(j)
write y(1n) back to slow memory

m number of slow memory refs 3n n2
f number of arithmetic operations 2n2
q f / m ? 2
Matrix-vector multiplication limited by slow
memory speed

19
Modeling Matrix-Vector Multiplication

Compute time for nxn 1000x1000 matrix
Time
f tf m tm f tf (1 tm/tf 1/q)
2n2 tf (1
tm/tf 1/2)
For tf and tm, using data from R. Vuducs PhD (pp
351-3)
http//bebop.cs.berkeley.edu/pubs/vuduc2003-disser
tation.pdf
For tm use minimum-memory-latency /
words-per-cache-line

machine balance (q must be at least this for ½
peak speed, but q2
20
Validating the Model

How well does the model predict actual
performance?
Actual DGEMV Most highly optimized code for the
platform
Model sufficient to compare across machines
But under-predicting on most recent ones due to
latency estimate

21
Naïve Matrix Multiply

implements C C AB
for i 1 to n
for j 1 to n
for k 1 to n
C(i,j) C(i,j) A(i,k) B(k,j)

Algorithm has 2n3 O(n3) Flops and operates on
3n2 words of memory q potentially as large as
2n3 / 3n2 O(n)
A(i,)
C(i,j)
C(i,j)
B(,j)

22
Naïve Matrix Multiply

implements C C AB
for i 1 to n
read row i of A into fast memory
for j 1 to n
read C(i,j) into fast memory
read column j of B into fast memory
for k 1 to n
C(i,j) C(i,j) A(i,k) B(k,j)
write C(i,j) back to slow memory

A(i,)
C(i,j)
C(i,j)
B(,j)

23
Naïve Matrix Multiply

Number of slow memory references on unblocked
matrix multiply
m n3 to read each column of B n times
n2 to read each row of A once
2n2 to read and write each element
of C once
n3 3n2
So q f / m 2n3 / (n3 3n2)
? 2 for large n, no improvement over
matrix-vector multiply
Inner two loops are just matrix-vector multiply,
of row i of A times B
Similar for any other order of 3 loops

A(i,)
C(i,j)
C(i,j)
B(,j)

24
Matrix-multiply, optimized several ways
Speed of n-by-n matrix multiply on Sun
Ultra-1/170, peak 330 MFlops
25
Naïve Matrix Multiply on IBM RS/6000
12000 would take 1095 years
T N4.7
Size 2000 took 5 days
O(N3) performance would have constant
cycles/flop Performance looks like O(N4.7)
Slide source Larry Carter, UCSD
26
Naïve Matrix Multiply on RS/6000
Page miss every iteration
TLB miss every iteration
Cache miss every 16 iterations
Page miss every 512 iterations
Slide source Larry Carter, UCSD
27
From Naïve to Blocked (or Tiled) Matrix Multiply

C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
C(i,j) A(i,1n)B(1n,j)
True if C(i,j) is a scalar (1x1 submatrix)
A(i,1n) is a row of A and B(1n,j) is a column
of B
Operation is a dot product
True if C(i,j) is a b x b submatrix
A(i,1n) is a b x n submatrix of A and B(1n,j)
is an n x b submatrix of B
C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
A(i,N)B(N,j) where
All factors are b x b subblocks
N n/b number of b x b blocks

28
Blocked (Tiled) Matrix Multiply

Consider A,B,C to be N-by-N matrices of b-by-b
subblocks where bn / N is called the
block size
for i 1 to N
for j 1 to N
read block C(i,j) into fast memory
for k 1 to N
read block A(i,k) into fast
memory
read block B(k,j) into fast
memory
C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks
write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)

B(k,j)
29
Blocked (Tiled) Matrix Multiply

Recall
m is amount memory traffic between slow and
fast memory
matrix has nxn elements, and NxN blocks each
of size bxb
f is number of floating point operations, 2n3
for this problem
q f / m is our measure of algorithm
efficiency in the memory system
So

m Nn2 read each block of B N3 times (N3
b2 N3 (n/N)2 Nn2) Nn2 read
each block of A N3 times 2n2 read
and write each block of C once (2N
2) n2 So computational intensity q f / m
2n3 / ((2N 2) n2)
? n / N b for large n So we
can improve performance by increasing the
blocksize b Can be much faster than
matrix-vector multiply (q2)
30
How large can we make computional intensity of
matmul?

The blocked algorithm has computational intensity
q ? b
The larger the block size, the more efficient our
algorithm will be
Limit All three blocks from A,B,C must fit in
fast memory (cache), so we cannot make these
blocks arbitrarily large
Assume your fast memory has size Mfast
3b2 ? Mfast, so q ? b ?
(Mfast/3)1/2

To build a machine to run matrix multiply at 1/2
peak arithmetic speed of the machine, we need a
fast memory of size
Mfast ? 3b2 ? 3q2 3(tm/tf)2
This size is reasonable for L1 cache, but not for
register sets
Note analysis assumes it is possible to schedule
the instructions perfectly

31
Limits to Optimizing Matrix Multiply

The blocked algorithm changes the order in which
values are accumulated into each Ci,j by
applying commutativity and associativity
Get slightly different answers from naïve code,
because of roundoff - OK
We just showed that the blocked algorithm has
computational intensity
q ? b ? (Mfast/3)1/2
Theorem (Hong Kung, 1981) Any reorganization
of this algorithm (that uses only
commutativity/associativity) is limited to q O(
(Mfast)1/2 )
Said another way,

words moved between fast/slow memory ?( n3 /
Mfast1/2 )
Proof of more general result later
Can get a lower bound on the latency cost (
messages sent) by dividing above lower bound by
maximum message size Mfast
messages for sequential MatMul ?( n3 / Mfast
3/2 )
Ex disk accesses

32
BLAS speeds on an IBM RS6000/590
Peak speed 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
33
Dense Linear Algebra BLAS2 vs. BLAS3

BLAS2 and BLAS3 have very different computational
intensity, and therefore different performance

Data source Jack Dongarra
34
Recursion Cache Oblivious Algorithms

The tiled algorithm requires finding a good block
size
Machine dependent
What if there are multiple levels of cache? Need
to block b x b matrix multiply in inner most
loop
1 level of memory ? 3 nested loops (naïve
algorithm)
2 levels of memory ? 6 nested loops
3 levels of memory ? 9 nested loops
Cache Oblivious Algorithms offer an alternative
Treat nxn matrix multiply as a set of smaller
problems
Eventually, these will fit in cache
Will minimize words moved between every level
of memory hierarchy (between L1 and L2 cache, L2
and L3, L3 and main memory etc.) at least
asymptotically

35
Recursive Matrix Multiplication (RMM) (1/2)

For simplicity square matrices with n 2m
C A B
True when each Aij etc 1x1 or n/2 x n/2

func C RMM (A, B, n) if n 1, C A
B, else C11 RMM (A11 , B11 , n/2)
RMM (A12 , B21 , n/2) C12 RMM (A11
, B12 , n/2) RMM (A12 , B22 , n/2)
C21 RMM (A21 , B11 , n/2) RMM (A22 , B21 ,
n/2) C22 RMM (A21 , B12 , n/2)
RMM (A22 , B22 , n/2) return
36
Recursive Matrix Multiplication (2/2)
func C RMM (A, B, n) if n1, C A B,
else C11 RMM (A11 , B11 , n/2)
RMM (A12 , B21 , n/2) C12 RMM (A11
, B12 , n/2) RMM (A12 , B22 , n/2)
C21 RMM (A21 , B11 , n/2) RMM (A22 , B21 ,
n/2) C22 RMM (A21 , B12 , n/2)
RMM (A22 , B22 , n/2) return
A(n) arithmetic operations in RMM( . , . ,
n) 8 A(n/2) 4(n/2)2 if n gt 1,
else 1 2n3 same operations as
usual, in different order M(n) words moved
between fast, slow memory by RMM( . , . , n)
8 M(n/2) 4(n/2)2 if 3n2 gt Mfast ,
else 3n2 O( n3 / (Mfast )1/2 n2 )
same as blocked matmul
37
Recursion Cache Oblivious Algorithms

Recursion for general A (nxm) B (mxp)

Case1 mgt maxn,p split A horizontally
Case 2 ngt maxm,p split A vertically and B
horizontally
Case 3 pgt maxm,n split B vertically
Attains lower bound in O() sense

Case 1
Case 3
38
Experience with Cache-Oblivious Algorithms

In practice, need to cut off recursion well
before 1x1 blocks
Call Micro-kernel for small blocks, eg 16 x 16
Implementing a high-performance Cache-Oblivious
code is not easy
Using fully recursive approach with highly
optimized recursive micro-kernel, Pingali et al
report that they never got more than 2/3 of peak.
Issues with Cache Oblivious (recursive) approach
Recursive Micro-Kernels yield less performance
than iterative ones using same scheduling
techniques
Pre-fetching is needed to compete with best code
not well-understood in the context of Cache
Oblivous codes

Unpublished work, presented at LACSI 2006
39
Recursive Data Layouts

A related idea is to use a recursive structure
for the matrix
Goal any square submatrix (suitably aligned)
should lie in contiguous memory
Improves locality with machine-independent data
structure
Can minimize latency with multiple levels of
memory hierarchy
This figure shows Z-Morton Ordering (space
filling curve)
Other orderings possible, eg C-Morton
See papers on cache oblivious algorithms and
recursive layouts
Gustavson, Kagstrom, et al, SIAM Review, 2004

Advantages
the recursive layout works well for any cache
size
Disadvantages
The index calculations to find Ai,j are
expensive
Implementations switch to column-major for small
sizes

40
Strassens Matrix Multiply

The traditional algorithm (with or without
tiling) has O(n3) flops
Strassen discovered an algorithm with
asymptotically lower flops
O(n2.81)
Consider a 2x2 matrix multiply, normally takes 8
multiplies, 4 adds
Strassen does it with 7 multiplies and 18 adds

Let M m11 m12 a11 a12 b11 b12
m21 m22 a21 a22 b21 b22 Let p1
(a12 - a22) (b21 b22)
p5 a11 (b12 - b22) p2 (a11
a22) (b11 b22)
p6 a22 (b21 - b11) p3 (a11 - a21)
(b11 b12) p7
(a21 a22) b11 p4 (a11 a12)
b22 Then m11 p1 p2 - p4 p6 m12
p4 p5 m21 p6 p7 m22
p2 - p3 p5 - p7
Extends to nxn by divideconquer
41
Strassen Matrix Multiplication (2/2)
func C StrMM (A, B, n) if n1 (or small
enough), C A B, else P1 StrMM
(A12 - A22 , B21 B22 , n/2) P2
StrMM (A11 A22 , B11 B22 , n/2)
P3 StrMM (A11 - A21 , B11 B12 , n/2)
P4 StrMM (A11 A12 , B22 , n/2)
P5 StrMM (A11 , B12 - B22 , n/2) P6
StrMM (A22 , B21 B11 , n/2) P7
StrMM (A21 A22 , B11 , n/2) C11
P1 P2 P4 P6, C12 P4 P5
C22 P2 - P3 P5 P7, C21 P6 P7
return
42
Strassen (continued)

Asymptotically faster
Several times faster for large n in practice
Cross-over depends on machine
Available in some libraries (but see below)
Tuning Strassen's Matrix Multiplication for
Memory Efficiency, M. S. Thottethodi, S.
Chatterjee, and A. Lebeck, in Proceedings of
Supercomputing '98
Caveats
Needs more memory than standard algorithm
Can be a little less accurate because of roundoff
error
Forbidden by rules of Top500 list (so not widely
used)

43
Other Fast Matrix Multiplication Algorithms

Current worlds record is O(n 2.376... )
Coppersmith Winograd, 1987
Why does Hong/Kung theorem not apply?
Extension is open problem
Possibility of O(n2?) algorithm!
Cohn, Umans, Kleinberg, 2003
Can show they all can be made numerically stable
D., Dumitriu, Holtz, Kleinberg, 2007
Can do rest of linear algebra (solve Axb, least
squares, Ax?x, etc) as fast , and numerically
stably
D., Dumitriu, Holtz, 2008
Fast methods (besides Strassen) may need
unrealistically large n
But can reuse ideas to minimize communication

44
Tuning Code in Practice

Tuning code can be tedious
Lots of code variations to try besides blocking
details later
Machine hardware performance hard to predict
Compiler behavior hard to predict
Response Autotuning
Let computer generate large set of possible code
variations, and search them for the fastest ones
Field started with CS267 homework assignment in
mid 1990s
PHiPAC, leading to ATLAS, incorporated in Matlab
We still use the same assignment
We (and others) are extending autotuning to other
dwarfs / motifs
Still need to understand how to do it by hand
Not every code will have an autotuner
Need to know if you want to build autotuners

45
Search Over Block Sizes

Performance models are useful for high level
algorithms
Helps in developing a blocked algorithm
Models have not proven very useful for block size
selection
too complicated to be useful
See work by Sid Chatterjee for detailed model
too simple to be accurate
Multiple multidimensional arrays, virtual memory,
etc.
Speed depends on matrix dimensions, details of
code, compiler, processor

46
What the Search Space Can Look Like
Number of columns in register block
Number of rows in register block
A 2-D slice of a 3-D register-tile search space.
The dark blue region was pruned. (Platform Sun
Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0
compiler)
47
Tiling Alone Might Not Be Enough

Naïve and a naïvely tiled code on Itanium 2
Searched all block sizes to find best, b56
Starting point for next homework

48
Example Select a Matmul Implementation
49
Example Support Vector Classification
50
ATLAS (DGEMM n 500)
Source Jack Dongarra

ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.

51
Optimizing in Practice

Tiling for registers
loop unrolling, use of named register variables
Tiling for multiple levels of cache and TLB
Exploiting fine-grained parallelism in processor
superscalar pipelining
Complicated compiler interactions
Hard to do by hand (but you should try!)
Automatic optimization an active research area
BeBOP bebop.cs.berkeley.edu , FFTW, Spiral,
Older work
PHiPAC www.icsi.berkeley.edu/bilmes/phipac
in particular tr-98-035.ps.gz
ATLAS www.netlib.org/atlas
"Performance Optimization of Numerically
Intensive Codes", by Stefan Goedecker and Adolfy
Hoisie, SIAM 2001.

52
Removing False Dependencies

Using local variables, reorder operations to
remove false dependencies

ai bi c ai1 bi1 d
false read-after-write hazard between ai and
bi1
float f1 bi float f2 bi1 ai f1
c ai1 f2 d

With some compilers, you can declare a and b
unaliased.
Done via restrict pointers, compiler flag, or
pragma

53
Exploit Multiple Registers

Reduce demands on memory bandwidth by pre-loading
into local variables

while( ) res filter0signal0
filter1signal1
filter2signal2 signal
float f0 filter0 float f1 filter1 float
f2 filter2 while( ) res
f0signal0 f1signal1
f2signal2 signal
also register float f0
Example is a convolution
54
Minimize Pointer Updates

Replace pointer updates for strided memory
addressing with constant array offsets

f0 r8 r8 4 f1 r8 r8 4 f2 r8
r8 4
f0 r80 f1 r84 f2 r88 r8 12

Pointer vs. array expression costs may differ.
Some compilers do a better job at analyzing one
than the other

55
Loop Unrolling

Expose instruction-level parallelism

float f0 filter0, f1 filter1, f2
filter2 float s0 signal0, s1 signal1,
s2 signal2 res f0s0 f1s1
f2s2 do signal 3 s0 signal0
res0 f0s1 f1s2 f2s0 s1
signal1 res1 f0s2 f1s0 f2s1
s2 signal2 res2 f0s0 f1s1
f2s2 res 3 while( )
56
Expose Independent Operations

Hide instruction latency
Use local variables to expose independent
operations that can execute in parallel or in a
pipelined fashion
Balance the instruction mix (what functional
units are available?)

f1 f5 f9 f2 f6 f10 f3 f7 f11 f4
f8 f12
57
Copy optimization

Copy input operands or blocks
Reduce cache conflicts
Constant array offsets for fixed size blocks
Expose page-level locality
Alternative use different data structures from
start (if users willing)

Original matrix (numbers are addresses)
Reorganized into 2x2 blocks
0
4
8
12
0
2
8
10
1
5
9
13
1
3
9
11
2
6
10
14
4
6
12
13
3
7
11
15
5
7
14
15
58
Summary of Performance Tuning

Performance programming on uniprocessors requires
understanding of memory system moving data
slower than arithmetic
understanding of fine-grained parallelism in
processor
Simple performance models can aid in
understanding
Two ratios are key to efficiency (relative to
peak)
computational intensity of the algorithm
q f/m floating point operations / slow
memory references
machine balance in the memory system
tm/tf time for slow memory reference / time for
floating point operation
Want q gt tm/tf to get half machine peak
Blocking (tiling) is a basic approach to increase
q
Techniques apply generally, but the details
(e.g., block size) are architecture dependent
Similar techniques are possible on other data
structures and algorithms

59
Extra slides
60
More Reading for Today

Sourcebook Chapter 3, (note that chapters 2 and 3
cover the material of lecture 2 and lecture 3,
but not in the same order).
"Performance Optimization of Numerically
Intensive Codes", by Stefan Goedecker and Adolfy
Hoisie, SIAM 2001.
Web pages for reference
BeBOP Homepage
ATLAS Homepage
BLAS (Basic Linear Algebra Subroutines),
Reference for (unoptimized) implementations of
the BLAS, with documentation.
LAPACK (Linear Algebra PACKage), a standard
linear algebra library optimized to use the BLAS
effectively on uniprocessors and shared memory
machines (software, documentation and reports)
ScaLAPACK (Scalable LAPACK), a parallel version
of LAPACK for distributed memory machines
(software, documentation and reports)
Tuning Strassen's Matrix Multiplication for
Memory Efficiency Mithuna S. Thottethodi,
Siddhartha Chatterjee, and Alvin R. Lebeck in
Proceedings of Supercomputing '98, November 1998
postscript
Recursive Array Layouts and Fast Parallel Matrix
Multiplication by Chatterjee et al. IEEE TPDS
November 2002.

61
Questions You Should Be Able to Answer

What is the key to understand algorithm
efficiency in our simple memory model?
What is the key to understand machine efficiency
in our simple memory model?
What is tiling?
Why does block matrix multiply reduce the number
of memory references?
What are the BLAS?
Why does loop unrolling improve uniprocessor
performance?

62
Proof of Communication Lower Bound on C AB
(1/6)

Proof from Irony/Tishkin/Toledo (2004)
Well need it for the rest of linear algebra
Think of instruction stream being executed
Looks like add, load, multiply, store,
load, add,
We want to count the number of loads and stores,
given that we are multiplying n-by-n matrices C
AB using the usual 2n3 flops, possibly
reordered assuming addition is commutative/associa
tive
It actually isnt associative in floating point,
but close enough
Assuming that at most M words can be stored in
fast memory
Outline
Break instruction stream into segments, each
containing M loads and stores
Somehow bound the maximum number of adds and
multiplies that could be done in each segment,
call it F
So F segments ? 2n3 , and
segments ? 2n3 / F
So loads stores M segments ? M 2
n3 / F

63
Proof of Communication Lower Bound on C AB
(2/6)

Given segment of instruction stream with M loads
stores, how many adds multiplies (F) can we
do?
At most 2M entries of C, 2M entries of A and/or
2M entries of B can be accessed
Use geometry
Represent 2n3 operations by n x n x n cube
One n x n face represents A
each 1 x 1 subsquare represents one A(i,k)
One n x n face represents B
each 1 x 1 subsquare represents one B(k,j)
One n x n face represents C
each 1 x 1 subsquare represents one C(i,j)
Each 1 x 1 x 1 subcube represents one C(i,j)
A(i,k) B(k,j)

64
Proof of Communication Lower Bound on C AB
(3/6)
k
C face
C(2,3)
C(1,1)
B(3,1)
A(1,3)
j
B(1,3)
A(2,1)
B face
i
A face
65
Proof of Communication Lower Bound on C AB
(4/6)

Given segment of instruction stream with M load
stores, how many adds multiplies (F) can we do?
At most 2M entries of C, and/or of A and/or of
B can be accessed
Use geometry
Represent 2n3 operations by n x n x n cube
One n x n face represents A
each 1 x 1 subsquare represents one A(i,k)
One n x n face represents B
each 1 x 1 subsquare represents one B(k,j)
One n x n face represents C
each 1 x 1 subsquare represents one C(i,j)
Each 1 x 1 x 1 subcube represents one C(i,j)
A(i,k) B(k,j)

If we have at most 2M A squares, 2M B
squares, and 2M C squares on faces, how many
cubes can we have?

66
Proof of Communication Lower Bound on C AB
(5/6)
cubes in black box with side lengths x, y
and z Volume of black box xyz (A?s
B?s C?s )1/2 ( xz zy yx)1/2
67
Proof of Communication Lower Bound on C AB
(6/6)

Consider one segment of instructions with M
loads and stores
There can be at most 2M entries of A, B, C
available in one segment
Volume of set of cubes representing possible
multiply/adds (2M 2M 2M)1/2 (2M) 3/2 F
Segments ? 2n3 / F
Loads Stores M Segments ? M 2n3 / F
n3 / (2M)1/2

Parallel Case apply reasoning to one processor
out of P
Adds and Muls 2n3 / P (assuming load
balanced)
M n2 / P (each processor gets equal fraction of
matrix)
Load Stores words communicated with
other procs ? M (2n3 /P) / F M (2n3 /P) /
(2M) 3/2 n2 / (2P)1/2

68
Communication Lower Bound for Parallel MatMul

Apply reasoning to one processor out of P
Adds and Muls 2n3 / P
At least one processor must do this much work
M n2 / P
Assume each processor gets equal fraction of
matrix
(Other assumptions possible)
Load Stores words communicated with
other procs ? M (2n3 /P) / F M (2n3 /P) /
(2M) 3/2 n2 / (2P)1/2

Possible to communicate less than this
Need to increase M by having multiple copies of
matrices
3D algorithm
P1/3 copies of each matrix, so M n2 / P2/3
Lower bound becomes ?(n2 / (2P)2/3) ,
O(P1/6) times smaller than before attainable

69
Lower bounds on latency O(messages)

Hong/Kung and Irony/Tishkin/Toledo theorem is a
lower bound on amount of data communicated
Bandwidth cost
Can get a lower bound on the latency cost (number
of messages sent) by dividing about lower bound
by maximum message size fast memory size M
messages for sequential MatMul ?( n3 / M3/2
)
Ex disk accesses
messages for parallel Matmul ?((n2 / P1/2 ) /
(n2 / P)) ?( P1/2)

70
Communication lower bounds for all Linear Algebra

Irony/Tiskin/Toledo proof can be extended
Proof does not depend on doing matrix
multiplication
Out(i,j) func( In1(i,k), In2(k,j) for k ? S,
where S is some set )
Only need Out(i,j), etc to be a one-to-one map
from (i,j) to memory
Proof does not depend on inputs being dense
S can be any set, depending on (i,j)
Proof does not assume inputs and outputs do not
overlap
Eg in Gaussian Elimination A LU, L and U
overwrite A
Applies to LU, Cholesky, solving triangular
systems,
Dense or sparse matrices
Sequential or parallel algorithm
Lower bound on words moved ? ( flops / M1/2 )
Lower bound on messages ? ( flops / M3/2 )
Also extends to QR, eigenvalue algorithms, SVD
New proof need to account for data that is
created in fast memory, used, discarded without
being written to memory
Same lower bounds

Review of material on performance of
matrix multiplication so far
More about tuning matrix multiplication, other
codes

72
Using a Simple Model of Memory to Optimize

Assume just 2 levels in the hierarchy, fast and
slow
All data initially in slow memory
m number of memory elements (words) moved
between fast and slow memory
tm time per slow memory operation
f number of arithmetic operations
tf time per arithmetic operation ltlt tm
q f / m average number of flops per slow
memory access
Minimum possible time f tf when all data in
fast memory
Actual time
f tf m tm f tf (1 tm/tf 1/q)
Larger q means time closer to minimum f tf
q ? tm/tf needed to get at least half of peak
speed
Goal reorganize algorithm (change order of
operations) to maximize q ( minimize tm )

73
Naïve Matrix Multiply no better than
Matrix-vector multiply

implements C C AB
for i 1 to n
read row i of A into fast memory n reads,
n2 total
for j 1 to n
read C(i,j) into fast memory 1 read,
n2 total
read column j of B into fast memory n
reads, n3 total
for k 1 to n
C(i,j) C(i,j) A(i,k) B(k,j)
all data in fast memory
write C(i,j) back to slow memory 1
write, n2 total

m slow memory accesses n3 3n2 , so q
f/m 2n3 /( n3 3n2) ? 2
74
From Naïve to Blocked Matrix Multiply

C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
C(i,j) A(i,1n)B(1n,j)
True if C(i,j) is a scalar (1x1 submatrix)
A(i,1n) is a row of A and B(1n,j) is a column
of B
Operation is a dot product
True if C(i,j) is a b x b submatrix
A(i,1n) is a b x n submatrix of A and B(1n,j)
is an n x b submatrix of B
C(i,j) C(i,j) A(i,1)B(1,j) A(i,2)B(2,j)
A(i,N)B(N,j) where
All factors are b x b subblocks
N n/b number of b x b blocks

75
Blocked (Tiled) Matrix Multiply

Consider A,B,C to be N-by-N matrices of b-by-b
subblocks where
bn / N is called the block size
for i 1 to N
for j 1 to N
read block C(i,j) into fast memory
b2 reads, N2 b2 n2 total
for k 1 to N
read block A(i,k) into fast
memory b2 reads, N3 b2 n3/b total
read block B(k,j) into fast
memory b2 reads, N3 b2 n3/b total
C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks
all data in fast memory
write block C(i,j) back to slow memory
b2 reads, N2 b2 n2 total

m slow memory accesses 2n3/b 2n2 , so q
f/m ? b
76
Blocked (Tiled) Matrix Multiply (Review)

m slow memory accesses 2n3/b 2n2 ,
so q f/m 2n3 / ( 2n3/b 2n2 ) ? b for
large n
To minimize m ( maximize q ) we want to make b as
large as possible
3b2 ? Mfast, so that all three b x b
submatrices C(i,j), A(i,k), B(k,j) fit
So q ? b ? (Mfast/3)1/2
Theorem (Hong/Kung, 1981 Irony/Tishkin/Toledo
2004) Any reorganization of
this algorithm (that uses only associativity and
commutativity of addition) is limited to q O(
(Mfast)1/2 )
Blocked matrix multiplication minimizes the
number of slow memory accesses (up to a constant
factor)

77
(No Transcript)
78
Current Records for Solving Dense Systems
(11/2008)
www.netlib.org, click on Performance Database
Server

Gigaflops Machine n100
n1000 Any n Peak Roadrunner (LANL)
1.1M
1.5M (1.1 Petaflops, 130K cores, 2.5 MW,
n2.3M)

NEC SX 8
(fastest in 2005) (8 proc, 2 GHz)
75.1
128 (1 proc, 2 GHz) 2.2
15.0 16
Palm Pilot III .00000169
(1.69 Kiloflops)
79
For all linear algebra problemsEx LAPACK Table
of Contents

Linear Systems
Least Squares
Overdetermined, underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax ?x), Generalized (Ax?xB)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant
subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Level of detail
Simple Driver
Expert Drivers with error bounds,
extra-precision, other options
Lower level routines (apply certain kind of
orthogonal transformation)

80
For all matrix/problem structuresEx LAPACK
Table of Contents

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

81
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

82
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

83
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

84
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

85
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

86
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

87
Motif/Dwarf Common Computational Methods (Red
Hot ? Blue Cool)
What do commercial and CSE applications have in
common?
88
Basic Linear Algebra Subroutines (BLAS)

Industry standard interface (evolving)
www.netlib.org/blas, www.netlib.org/blas/blast-
-forum
Vendors, others supply optimized implementations
History
BLAS1 (1970s)
vector operations dot product, saxpy (yaxy),
etc
m2n, f2n, q 1 or less
BLAS2 (mid 1980s)
matrix-vector operations matrix vector multiply,
etc
mn2, f2n2, q2, less overhead
somewhat faster than BLAS1
BLAS3 (late 1980s)
matrix-matrix operations matrix matrix multiply,
etc
m ? 3n2, fO(n3), so qf/m can possibly be as
large as n (or Mfast1/2 ), so BLAS3
is potentially much faster than BLAS2
Good algorithms used BLAS3 when possible (LAPACK
ScaLAPACK)
More later in course

89
Locality in Other Algorithms

The performance of any algorithm is limited by q
computational intensity
In matrix multiply, we increase q by changing
computation order
increased temporal locality
For other algorithms and data structures, even
hand-transformations are still an open problem
Lots of open problems

90
Simplifying Assumptions

What simplifying assumptions did we make in this
analysis?
Ignored parallelism in processor between memory
and arithmetic within the processor
Sometimes drop arithmetic term in this type of
analysis
Assumed fast memory was large enough to hold
three vectors
Reasonable if we are talking about any level of
cache
Not if we are talking about registers (32 words)
Assumed the cost of a fast memory access is 0
Reasonable if we are talking about registers
Not necessarily if we are talking about cache
(1-2 cycles for L1)
Memory latency is constant
Could simplify even further by ignoring memory
operations in X and Y vectors
Mflop rate/element 2 / (2 tf tm)

91
Experiments on Search vs. Modeling

Study compares search (Atlas) to optimization
selection based on performance models
Ten architectures
Model did well on most cases
Better on UltraSparc
Worse on Itanium
Eliminating performance gaps think globally,
search locally
-small performance gaps local search
-large performance gaps
refine model
Substantial gap between ATLAS CGw/S and ATLAS
Unleashed on some machines

Source K. Pingali. Results from IEEE 05 paper
by K Yotov, X Li, G Ren, M Garzarán, D Padua, K
Pingali, P Stodghill.

Write a Comment

User Comments (0)