Title: Minimizing Communication in Linear Algebra
1Minimizing Communication in Linear Algebra
2Outline
- Background 2 Big Events
- Why avoiding communication is important
- Communication-optimal direct linear algebra
- Autotuning of sparse matrix codes
- Communication-optimal iterative linear algebra
3Collaborators
- Grey Ballard, UCB EECS
- Ioana Dumitriu, U. Washington
- Laura Grigori, INRIA
- Ming Gu, UCB Math
- Mark Hoemmen, UCB EECS
- Olga Holtz, UCB Math TU Berlin
- Julien Langou, U. Colorado Denver
- Marghoob Mohiyuddin, UCB EECS
- Oded Schwartz , TU Berlin
- Hua Xiang, INRIA
- Sam Williams, LBL
- Vasily Volkov, UCB EECS
- Kathy Yelick, UCB EECS NERSC
- BeBOP group, bebop.cs.berkeley.edu
- ParLab, parlab.eecs.berkeley.edu
Thanks to Intel, Microsoft, UC Discovery, NSF,
DOE,
42 Big Events
- Multicore revolution, requiring all software
(where performance matters!) to change - ParLab Industry/State funded center with 15
faculty, 50 grad students ( parlab.eecs.berkele
y.edu ) - Rare synergy between commercial and scientific
computing - Establishment of a new graduate program in
Computational Science and Engineering (CSE) - 117 Faculty from 22 Departments
- 60 existing courses, more being developed
- CS267 Applications of Parallel Computers
- www.cs.berkeley.edu/demmel/cs267_Spr09
- 3 Day Parallel Boot Camp in August
- parlab.eecs.berkeley.edu/2009bootcamp
5Parallelism Revolution is Happening Now
- Chip density continuing to increase 2x every 2
years - Clock speed is not
- Number of processor cores may double instead
- There is little or no more hidden parallelism
(ILP) to be found - Parallelism must be exposed to and managed by
software
Source Intel, Microsoft (Sutter) and Stanford
(Olukotun, Hammond)
6The 7 Dwarfs of High Performance Computing
- Phil Colella (LBL) identified 7 kernels out of
which most large scale simulation and
data-analysis programs are composed
- Dense Linear Algebra
- Ex Solve Axb or Ax ?x where A is a dense
matrix - Sparse Linear Algebra
- Ex Solve Axb or Ax ?x where A is a sparse
matrix (mostly zero) - Operations on Structured Grids
- Ex Anew(i,j) 4A(i,j) A(i-1,j) A(i1,j)
A(i,j-1) A(i,j1) - Operations on Unstructured Grids
- Ex Similar, but list of neighbors varies from
entry to entry - Spectral Methods
- Ex Fast Fourier Transform (FFT)
- N-Body (aka Particle Methods)
- Ex Compute electrostatic forces using Fast
Multiple Method - Monte Carlo (aka MapReduce)
- Ex Many independent simulations using different
inputs
7Motif/Dwarf Common Computational Methods (Red
Hot ? Blue Cool)
www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006
-183.pdf
82 Big Events
- Multicore revolution, requiring all software
(where performance matters!) to change - ParLab Industry/State funded center with 15
faculty, 50 grad students ( parlab.eecs.berkele
y.edu ) - Rare synergy between commercial and scientific
computing - Establishment of a new graduate program in
Computational Science and Engineering (CSE) at
UCB - 117 Faculty from 22 Departments
- 60 existing courses, more being developed
- CS267 Applications of Parallel Computers
- www.cs.berkeley.edu/demmel/cs267_Spr09
- 3 Day Parallel Boot Camp in August
- parlab.eecs.berkeley.edu/2009bootcamp
9Why avoiding communication is important (1/2)
- Algorithms have two costs
- Arithmetic (FLOPS)
- Communication moving data between
- levels of a memory hierarchy (sequential case)
- processors over a network (parallel case).
10Why avoiding communication is important (2/2)
- Running time of an algorithm is sum of 3 terms
- flops time_per_flop
- words moved / bandwidth
- messages latency
- Time_per_flop ltlt 1/ bandwidth ltlt latency
- Gaps growing exponentially with time
Annual improvements Annual improvements Annual improvements Annual improvements
Time_per_flop Bandwidth Latency
Network 26 15
DRAM 23 5
59
11Naïve Sequential Matrix Multiply C C AB
- for i 1 to n
- read row i of A into fast memory, n2 reads
- for j 1 to n
- read C(i,j) into fast memory, n2 reads
- read column j of B into fast memory, n3
reads - for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)
- write C(i,j) back to slow memory, n2
writes
n3 O(n2) reads/writes altogether
A(i,)
C(i,j)
C(i,j)
B(,j)
12Less Communication with Blocked Matrix Multiply
- Blocked Matmul C AB explicitly refers to
subblocks of A, B and C of dimensions that depend
on cache size
- Break Anxn, Bnxn, Cnxn into bxb blocks labeled
A(i,j), etc - b chosen so 3 bxb blocks fit in cache
- for i 1 to n/b, for j1 to n/b, for k1 to
n/b - C(i,j) C(i,j) A(i,k)B(k,j) b x
b matmul, 4b2 reads/writes -
- (n/b)3 4b2 4n3/b reads/writes altogether
- Minimized when 3b2 cache size M, yielding
O(n3/M1/2) reads/writes - What if we had more levels of memory? (L1, L2,
cache etc)? - Would need 3 more nested loops per level
13Blocked vs Cache-Oblivious Algorithms
- Blocked Matmul C AB explicitly refers to
subblocks of A, B and C of dimensions that depend
on cache size
Break Anxn, Bnxn, Cnxn into bxb blocks labeled
A(i,j), etc b chosen so 3 bxb blocks fit in
cache for i 1 to n/b, for j1 to n/b, for
k1 to n/b C(i,j) C(i,j) A(i,k)B(k,j)
b x b matmul another level of
memory would need 3 more loops
- Cache-oblivious Matmul C AB is independent of
cache
Function C MM(A,B) If A and B are 1x1 C
A B else Break Anxn, Bnxn, Cnxn into
(n/2)x(n/2) blocks labeled A(i,j), etc for
i 1 to 2, for j 1 to 2, for k 1 to
2 C(i,j) C(i,j) MM( A(i,k), B(k,j)
) n/2 x n/2 matmul
14Direct linear algebra Prior Work on Matmul
- Assume n3 algorithm (i.e. not Strassen-like)
- Sequential case, with fast memory of size M
- Lower bound on words moved to/from slow memory
? (n3 / M1/2 ) Hong, Kung, 81 - Attained using blocked or cache-oblivious
algorithms
- Parallel case on P processors
- Let NNZ be total memory needed assume load
balanced - Lower bound on words communicated
? (n3 /(P NNZ )1/2 )
Irony, Tiskin, Toledo, 04
NNZ (name of alg) Lower bound on words Attained by
3n2 (2D alg) ? (n2 / P1/2 ) Cannon, 69
3n2 P1/3 (3D alg) ? (n2 / P2/3 ) Johnson,93
15New lower bound for all direct linear algebra
- Let M fast memory size per processor
- words_moved by at least one processor
- (flops / M1/2 )
- messages_sent by at least one processor
- (flops / M3/2 )
- Holds for
- BLAS, LU, QR, eig, SVD,
- Some whole programs (sequences of these
operations, no matter how they are interleaved,
eg computing Ak) - Dense and sparse matrices (where flops ltlt n3 )
- Sequential and parallel algorithms
- Some graph-theoretic algorithms (eg
Floyd-Warshall) - See BDHS09 proof sketch if time!
16Can we attain these lower bounds?
- Do conventional dense algorithms as implemented
in LAPACK and ScaLAPACK attain these bounds? - Mostly not
- If not, are there other algorithms that do?
- Yes
- Goals for algorithms
- Minimize words ? (flops/ M1/2 )
- Minimize messages ? (flops/ M3/2 )
- Need new data structures
- Minimize for multiple memory hierarchy levels
- Cache-oblivious algorithms would be simplest
- Fewest flops when matrix fits in fastest memory
- Cache-oblivious algorithms dont always attain
this - Only a few sparse algorithms so far
17Minimizing messages requires new data structures
- Need to load/store whole rectangular subblock of
matrix with one message - Incompatible with conventional columnwise
(rowwise) storage - Ex Rows (columns) not in contiguous memory
locations - Blocked storage store as matrix of bxb blocks,
each block stored contiguously - Ok for one level of memory hierarchy, what if
more? - Recursive blocked storage store each block
using subblocks
18Summary of dense sequential algorithms attaining
communication lower bounds
- Algorithms shown minimizing Messages use
(recursive) block layout - Many references (see reports), only some shown,
plus ours - Cache-oblivious are underlined, Green are ours,
? is unknown/future work
Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive Gustavson,97 Usual blocked algorithms (nested), or recursive Gustavson,97
Cholesky LAPACK (with b M1/2) Gustavson 97 BDHS09 Gustavson,97 Ahmed,Pingali,00 BDHS09 (?same) (?same)
LU with pivoting LAPACK (rarely) Toledo,97 , GDX 08 GDX 08 not partial pivoting Toledo, 97 GDX 08? GDX 08?
QR LAPACK (rarely) Elmroth,Gustavson,98 DGHL08 Frens,Wise,03 but 3x flops DGHL08 Elmroth, Gustavson,98 DGHL08 ? Frens,Wise,03 DGHL08 ?
Eig, SVD Not LAPACK BDD09 randomized, but more flops Not LAPACK BDD09 randomized, but more flops BDD09 BDD09
19Summary of dense 2D parallel algorithms
attaining communication lower bounds
- Assume nxn matrices on P processors, memory
per processor O(n2 / P) - ScaLAPACK assumes best block size b chosen
- Many references (see reports), Green are ours
- Recall lower bounds
- words_moved ?( n2 / P1/2 ) and
messages ?( P1/2 )
Algorithm Reference Factor exceeding lower bound for words_moved Factor exceeding lower bound for messages
Matrix multiply Cannon, 69 1 1
Cholesky ScaLAPACK log P log P
LU GDX08 ScaLAPACK log P log P log P ( N / P1/2 ) log P
QR DGHL08 ScaLAPACK log P log P log3 P ( N / P1/2 ) log P
Sym Eig, SVD BDD09 ScaLAPACK log P log P log3 P N / P1/2
Nonsym Eig BDD09 ScaLAPACK log P P1/2 log P log3 P N log P
20QR of a tall, skinny matrix is bottleneckUse
TSQR instead
Q is represented implicitly as a product (tree of
factors) DGHL08
21Minimizing Communication in TSQR
Multicore / Multisocket / Multirack / Multisite /
Out-of-core ?
Can choose reduction tree dynamically
22Performance of TSQR vs Sca/LAPACK
- Parallel
- Intel Clovertown
- Up to 8x speedup (8 core, dual socket, 10M x 10)
- Pentium III cluster, Dolphin Interconnect, MPICH
- Up to 6.7x speedup (16 procs, 100K x 200)
- BlueGene/L
- Up to 4x speedup (32 procs, 1M x 50)
- Both use Elmroth-Gustavson locally enabled by
TSQR - Sequential
- OOC on PowerPC laptop
- As little as 2x slowdown vs (predicted) infinite
DRAM - LAPACK with virtual memory never finished
- See UC Berkeley EECS Tech Report 2008-89
- General CAQR Communication-Avoiding QR in
progress
23Modeled Speedups of CAQR vs ScaLAPACK
Petascale up to 22.9x IBM Power 5
up to 9.7x Grid up to 11x
Petascale machine with 8192 procs, each at 500
GFlops/s, a bandwidth of 4 GB/s.
24Randomized Rank-Revealing QR
- Needed for eigenvalue and SVD algorithms
- func Q,R,V RRQR(A)
- V,R1 qr(randn(n,n)) V random with Haar
measure - Q,R qr(A VT ) so A Q R V
- Theorem (DDH). Provided that sr1 ltlt sr s1 ,
RRQR produces a rank-revealing decomposition
with high probability (when r O(n), the
probability is 1 O(n-c) ). - Can apply to
- A A1/-1 A2/-1 Ak/-1
- Do RRQR, followed by a series of (k-1) QR / RQ to
propagate unitary/orthogonal matrix to the front - Analogous to Stewart, 94
25What about o(n3) algorithms, like Strassen?
- Thm (DDHK07)
- Given any matmul running in O(n?) ops for some
?gt2, it can be made stable and still run in
O(n??) ops, for any ?gt0 - Current record ? ? 2.38
- Thm (DDH08)
- Given any stable matmul running in O(n??) ops,
it is possible to do backward stable dense linear
algebra in O(n??) ops - Also reduces communication to O(n??)
- But can do much better (work in progress)
26Direct Linear Algebra summary and future work
- Communication lower bounds on words_moved and
messages - BLAS, LU, Cholesky, QR, eig, SVD,
- Some whole programs (compositions of these
operations, no matter how
they are interleaved, eg computing Ak) - Dense and sparse matrices (where flops ltlt n3 )
- Sequential and parallel algorithms
- Some graph-theoretic algorithms (eg
Floyd-Warshall) - Algorithms that attain these lower bounds
- Nearly all dense problems (some open problems)
- A few sparse problems
- Speed ups in theory and practice
- Future work
- Lots to implement, tune
- Next generation of Sca/LAPACK on heterogeneous
architectures (MAGMA) - Few algorithms in sparse case
- Strassen-like algorithms
- 3D Algorithms (only for Matmul so far), may be
important for scaling
27How hard is hand-tuning matmul, anyway?
- Results of 22 student teams trying to tune
matrix-multiply, in CS267 Spr09 - Students given blocked code to start with
- Still hard to get close to vendor tuned
performance (ACML) - For more discussion, see www.cs.berkeley.edu/vol
kov/cs267.sp09/hw1/results/
28How hard is hand-tuning matmul, anyway?
29What part of the Matmul Search Space Looks Like
Number of columns in register block
Number of rows in register block
A 2-D slice of a 3-D register-tile search space.
The dark blue region was pruned. (Platform Sun
Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0
compiler)
30Autotuning Matmul with ATLAS (n 500)
Source Jack Dongarra
- ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the
vendor. - ATLAS written by C. Whaley, inspired by PHiPAC,
by Asanovic, Bilmes, Chin, D.
31Automatic Performance Tuning
- Goal Let machine do hard work of writing fast
code - What do tuning of Matmul, dense BLAS, FFTs,
signal processing, have in common? - Can do the tuning off-line once per
architecture, algorithm - Can take as much time as necessary (hours, a
week) - At run-time, algorithm choice may depend only on
few parameters (matrix dimensions, size of FFT,
etc.) - Cant always do tuning off-line
- Algorithm and implementation may strongly depend
on data only known at run-time - Ex Sparse matrix nonzero pattern determines both
best data structure and implementation of
Sparse-matrix-vector-multiplication (SpMV) - Part of search for best algorithm just be done
(very quickly!) at run-time
32Source Accelerator Cavity Design Problem (from
Ko via Husbands)
33Linear Programming Matrix
34A Sparse Matrix You Use Every Day
35SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel y(i) ? y(i)
A(i,j)x(j) for each row i for kptri to
ptri1 do yi yi valkxindk
36Example The Difficulty of Tuning
- n 21200
- nnz 1.5 M
- kernel SpMV
- Source NASA structural analysis problem
37Example The Difficulty of Tuning
- n 21200
- nnz 1.5 M
- kernel SpMV
- Source NASA structural analysis problem
- 8x8 dense substructure exploit this to limit
mem_refs
38Speedups on Itanium 2 The Need for Search
Mflop/s
Mflop/s
39Register Profile Itanium 2
1190 Mflop/s
190 Mflop/s
40Register Profiles IBM and Intel IA-64
Power3 - 17
Power4 - 16
252 Mflop/s
820 Mflop/s
122 Mflop/s
459 Mflop/s
Itanium 2 - 33
Itanium 1 - 8
247 Mflop/s
1.2 Gflop/s
107 Mflop/s
190 Mflop/s
41Register Profiles Sun and Intel x86
Ultra 2i - 11
Ultra 3 - 5
72 Mflop/s
90 Mflop/s
35 Mflop/s
50 Mflop/s
Pentium III-M - 15
Pentium III - 21
108 Mflop/s
122 Mflop/s
42 Mflop/s
58 Mflop/s
42Another example of tuning challenges
- More complicated non-zero structure in general
- N 16614
- NNZ 1.1M
43Zoom in to top corner
- More complicated non-zero structure in general
- N 16614
- NNZ 1.1M
443x3 blocks look natural, but
- More complicated non-zero structure in general
- Example 3x3 blocking
- Logical grid of 3x3 cells
- But would lead to lots of fill-in
45Extra Work Can Improve Efficiency!
- More complicated non-zero structure in general
- Example 3x3 blocking
- Logical grid of 3x3 cells
- Fill-in explicit zeros
- Unroll 3x3 block multiplies
- Fill ratio 1.5
- On Pentium III 1.5x speedup!
- Actual mflop rate 1.52 2.25 higher
46Extra Work Can Improve Efficiency!
- More complicated non-zero structure in general
- Example 3x3 blocking
- Logical grid of 3x3 cells
- Fill-in explicit zeros
- Unroll 3x3 block multiplies
- Fill ratio 1.5
- On Pentium III 1.5x speedup!
- Actual mflop rate 1.52 2.25 higher
- In general sample matrix to choose rxc block
structure using heuristic
47Source Accelerator Cavity Design Problem (from
Ko via Husbands)
48Post-RCM Reordering
49100x100 Submatrix Along Diagonal
50Microscopic Effect of RCM Reordering
Before Green Red After Green Blue
51Microscopic Effect of Combined RCMTSP
Reordering
Before Green Red After Green Blue
Speedups from 1.6x to 3.3x
52Summary of SpMV Performance Optimizations
- Sequential SpMV
- Register blocking (RB) up to 4x over CSR
- Symmetry 2.8x over CSR, 2.6x over RB
- Reordering to create dense structure splitting
2x over CSR - Variable block splitting 2.1x over CSR, 1.8x
over RB - Diagonals 2x over CSR
- Cache blocking 2.8x over CSR
- Multiple vectors (SpMM) 7x over CSR
- And combinations
- Multicore SpMV
- NUMA, Affinity, SW Prefetch, Cache/LS/TLB
blocking 4x over naïve Pthreads - Sequential Sparse triangular solve
- Hybrid sparse/dense data structure 1.8x over CSR
- Higher-level kernels
- AATx, ATAx 4x over CSR, 1.8x over RB
- More later
- All optimizations (being) automated in OSKI
(bebop.cs.berkeley.edu) - Being adopted by Cray Python interface in
progress
53Avoiding Communication in Iterative Linear Algebra
- k-steps of typical iterative solver for sparse
Axb or Ax?x - Does k SpMVs with starting vector
- Finds best solution among all linear
combinations of these k1 vectors - Many such Krylov Subspace Methods
- Conjugate Gradients, GMRES, Lanczos, Arnoldi,
- Goal minimize communication in Krylov Subspace
Methods - Assume matrix well-partitioned, with modest
surface-to-volume ratio - Parallel implementation
- Conventional O(k log p) messages, because k
calls to SpMV - New O(log p) messages - optimal
- Serial implementation
- Conventional O(k) moves of data from slow to
fast memory - New O(1) moves of data optimal
- Lots of speed up possible (modeled and measured)
- Price some redundant computation
- Much prior work
- See bebop.cs.berkeley.edu
- CG van Rosendale, 83, Chronopoulos and Gear,
89 - GMRES Walker, 88, Joubert and Carey, 92,
Bai et al., 94
54Minimizing Communication of GMRES to solve Axb
- GMRES find x in spanb,Ab,,Akb minimizing
Ax-b 2 - Cost of k steps of standard GMRES vs new GMRES
Standard GMRES for i1 to k w A
v(i-1) MGS(w, v(0),,v(i-1)) update
v(i), H endfor solve LSQ problem with
H Sequential words_moved O(knnz)
from SpMV O(k2n) from MGS Parallel
messages O(k) from SpMV
O(k2 log p) from MGS
Communication-avoiding GMRES W v, Av, A2v,
, Akv Q,R TSQR(W) Tall Skinny
QR Build H from R, solve LSQ
problem Sequential words_moved
O(nnz) from SpMV O(kn) from
TSQR Parallel messages O(1) from
computing W O(log p) from TSQR
- Oops W from power method, precision lost!
55Monomial basis Ax,,Akx fails to converge
A different polynomial basis does converge
56Speed ups of GMRES on 8-core Intel Clovertown
MHDY09
57Communication Avoiding Iterative Linear Algebra
Future Work
- Lots of algorithms to implement, autotune
- Make widely available via OSKI, Trilinos, PETSc,
Python, - Extensions to other Krylov subspace methods
- So far just Lanczos/CG and Arnoldi/GMRES
- BiCGStab, CGS, QMR,
- Add preconditioning
- Block diagonal are easiest
- Hierarchical, semiseparable,
- Works if interactions between distant points can
be compressed
58Summary
591/3
602/3
613/3
62 63Proof of Communication Lower Bound on C AB
(1/5)
- Proof from Irony/Toledo/Tiskin (2004)
- Original proof, then generalization
- Think of instruction stream being executed
- Looks like add, load, multiply, store,
load, add, - Each load/store moves a word between fast and
slow memory - We want to count the number of loads and stores,
given that we are multiplying n-by-n matrices C
AB using the usual 2n3 flops, possibly reordered
assuming addition is commutative/associative - Assuming that at most M words can be stored in
fast memory - Outline
- Break instruction stream into segments, each with
M loads and stores - Somehow bound the maximum number of flops that
can be done in each segment, call it F - So F segments ? T total flops 2n3 ,
so segments ? T / F - So loads stores M segments ? M T /
F
64Illustrating Segments, for M3
...
65Proof of Communication Lower Bound on C AB
(1/5)
- Proof from Irony/Toledo/Tiskin (2004)
- Original proof, then generalization
- Think of instruction stream being executed
- Looks like add, load, multiply, store,
load, add, - Each load/store moves a word between fast and
slow memory - We want to count the number of loads and stores,
given that we are multiplying n-by-n matrices C
AB using the usual 2n3 flops, possibly reordered
assuming addition is commutative/associative - Assuming that at most M words can be stored in
fast memory - Outline
- Break instruction stream into segments, each with
M loads and stores - Somehow bound the maximum number of flops that
can be done in each segment, call it F - So F segments ? T total flops 2n3 ,
so segments ? T / F - So loads stores M segments ? M T /
F
66Proof of Communication Lower Bound on C AB
(2/5)
- Given segment of instruction stream with M loads
stores, how many adds multiplies (F) can we
do? - At most 2M entries of C, 2M entries of A and/or
2M entries of B can be accessed - Use geometry
- Represent n3 multiplications by n x n x n cube
- One n x n face represents A
- each 1 x 1 subsquare represents one A(i,k)
- One n x n face represents B
- each 1 x 1 subsquare represents one B(k,j)
- One n x n face represents C
- each 1 x 1 subsquare represents one C(i,j)
- Each 1 x 1 x 1 subcube represents one C(i,j)
A(i,k) B(k,j) - May be added directly to C(i,j), or to temporary
accumulator
67Proof of Communication Lower Bound on C AB
(3/5)
k
C face
C(2,3)
C(1,1)
B(3,1)
A(1,3)
j
B(2,1)
A(1,2)
B(1,3)
B(1,1)
A(2,1)
A(1,1)
B face
i
- If we have at most 2M A squares, 2M B
squares, and 2M C squares on faces, how many
cubes can we have?
A face
68Proof of Communication Lower Bound on C AB
(4/5)
x
y
z
y
z
x
(i,k) is in A shadow if (i,j,k) in 3D set
(j,k) is in B shadow if (i,j,k) in 3D set
(i,j) is in C shadow if (i,j,k) in 3D
set Thm (Loomis Whitney, 1949) cubes in
3D set Volume of 3D set (area(A shadow)
area(B shadow) area(C shadow)) 1/2
cubes in black box with side lengths x, y
and z Volume of black box xyz ( xz zy
yx)1/2 (A?s B?s C?s )1/2
69Proof of Communication Lower Bound on C AB
(5/5)
- Consider one segment of instructions with M
loads, stores - Can be at most 2M entries of A, B, C available in
one segment - Volume of set of cubes representing possible
multiply/adds in one segment is (2M 2M
2M)1/2 (2M) 3/2 F - Segments ? 2n3 / F
- Loads Stores M Segments ? M 2n3 / F
n3 / (2M)1/2
- Parallel Case apply reasoning to one processor
out of P - Adds and Muls ? 2n3 / P (at least one proc
does this ) - M n2 / P (each processor gets equal fraction of
matrix) - Load Stores words moved from or to
other procs ? M (2n3 /P) / F M (2n3 /P) /
(2M) 3/2 n2 / (2P)1/2
70How to generalize this lower bound (1/4)
- It doesnt depend on C(i,j) being a matrix entry,
just a unique memory location (same for A(i,k)
and B(k,j) )
- It doesnt depend on C and A not overlapping (or
C and B, or A and B) - Some reorderings may change answer still get a
lower bound
- It doesnt depend on doing n3 multiply/adds
- It doesnt depend on C(i,j) just being a sum of
products
C(i,j) ?k A(i,k)B(k,j)
For all (i,j) ? S Mem(c(i,j))
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k ?
Sij , some other arguments)
71How to generalize this lower bound (2/4)
- It does assume the presence of an operand
generating a load and/or store how could this
not happen? - Mem(b(k,j)) could be reused in many more gijk
than (P) allows - Ex Compute C(m) A B.m (Matlab notation)
for m1 to t - Can move t1/2 times fewer words than ? (flops /
M1/2 ) - We might generate a result during a segment, use
it, and discard it, with generating any memory
traffic - Turns out QR, eig, SVD all may do this
- Need a different analysis for them later
72How to generalize this lower bound (3/4)
- Need to distinguish Sources, Destinations of
each operand in fast memory during a segment - Possible Sources
- S1 Already in fast memory at start of segment,
or read at most 2M - S2 Created during segment no bound without
more information - Possible Destinations
- D1 Left in fast memory at end of segment, or
written at most 2M - D2 Discarded no bound without more information
- Need to assume no S2/D2 arguments at most 4M of
others
73How to generalize this lower bound (4/4)
- Theorem To evaluate (P) with memory of size M,
where - fij and gijk are nontrivial functions of their
arguments - G is the total number of gijks,
- No S2/D2 arguments
- requires at least G/ (8M)1/2 M slow
memory references
- Simpler words_moved ? (flops / M1/2 )
- Corollary To evaluate (P) requires at least
G/ (81/2 M3/2 ) 1
messages (simpler ? (flops / M3/2 ) ) - Proof maximum message size is M
74Some Corollaries of this lower bound (1/2)
words_moved ? (flops / M1/2 )
?
- Theorem applies to dense or sparse, parallel or
sequential - MatMul, including ATA or A2
- Triangular solve C A-1X
- C(i,j) (X(i,j) - ?klti A(i,k)C(k,j)) / A(i,i)
A lower triangular - C plays double role of b and c in Model (P)
- LU factorization (any pivoting, LU or ILU)
- L(i,j) (A(i,j) - ?kltj L(i,k)U(k,j)) / U(j,j),
U(i,j) A(i,j) - ?klti L(i,k)U(k,j) - L (and U) play double role as c and a (c and b)
in Model (P) - Cholesky (any diagonal pivoting, C or IC)
- L(i,j) (A(i,j) - ?kltj L(i,k)LT(k,j)) / L(j,j),
L(j,j) (A(j,j) - ?kltj L(j,k)LT(k,j))1/2 - L (and LT) plays triple role as a, b and c
75Some Corollaries of this lower bound (2/2)
words_moved ? (flops / M1/2 )
?
- Applies to simplified operations
- Ex Compute determinant of A(i,j) func(i,j)
using LU, where each func(i,j) called just
once so no inputs and 1 output - Still get ? (flops / M1/2 2n2) words_moved,
by imposing loads/stores
- Applies to compositions of operations
- Ex Compute Ak by repeated squaring, only input
A, output Ak - Still get ? (flops / M1/2 n2 log k ) by
imposing loads/stores, - Holds for any interleaving of operations
- Applies to some graph algorithms
- Ex All-Pairs-Shortest-Path by Floyd-Warshall
gijk and fij min - Get ? (F / M1/2) words_moved where F? n4 , n3
log n , n3
76Lower bounds for Orthogonal Transformations (1/4)
- Needed for QR, eig, SVD,
- Analyze Blocked Householder Transformations
- ?j1 to b (I 2 uj uTj ) I U T UT where U
u1 , , ub - Treat Givens as 2x2 Householder
- Model details and assumptions
- Write (I U T UT )A A U(TUT A) A UZ
where Z T(UT A) - Only count operations in all A UZ operations
- Generically a large fraction of the work
- Assume Forward Progress, that each successive
Householder transformation leaves previously
created zeros zero Ok for - QR decomposition
- Reductions to condensed forms (Hessenberg,
tri/bidiagonal) - Possible exception bulge chasing in banded case
- One sweep of (block) Hessenberg QR iteration
77Lower bounds for Orthogonal Transformations (2/4)
- Perform many A UZ where Z T(UT A)
- First challenge to applying theorem need to
collect all A-UZ into one big set to which model
(P) applies - Write them all as A(i,j) A(i,j) - ?k U(i,k)
Z(k,j) where k index of
k-th transformation, - k not necessarily index of column of A it comes
from - Second challenge All Z(k,j) may be S2/D2
- Recall S2/D2 means computed on the fly and
discarded - Ex An x 2n Qn x n Rn x 2n where 2n2 M so A
fits in cache - Represent Q as n(n-1)/2 2x2 Householder (Givens)
transforms - There are n2(n-1)/2 ?(M3/2) nonzero Z(k,j), not
O(M) - Still only do ?(M3/2) flops during segment
- But cant use Loomis-Whitney to prove it!
- Need a new idea
78Lower bounds for Orthogonal Transformations (3/4)
- Dealing with Z(k,j) being S2/D2
- How to bound flops in A(i,j) A(i,j) - ?k
U(i,k) Z(k,j) ? - Neither A nor U is S2/D2
- A either turned into R or U, which are output
- So at most 2M of each during segment
- flops ( U(i,k) ) ( columns A(,j) present
) - ( U(i,k) ) ( A(i,j) / min
nonzeros per column of A) - h O(M) / r
where h O(M) - How small can r be?
- To store h U(i,k) Householder vector entries
in r rows, there can be at most r in the
first column, r-1 in the second, etc., - (to maintain Forward Progress) so r(r-1)/2 ?
h so r ? h1/2 - flops h O(M) / r O(M) h1/2 O(M3/2) as
desired
79Lower bounds for Orthogonal Transformations (4/4)
- Theorem words_moved by QR decomposition using
(blocked) Householder transformations ?( flops
/ M1/2 ) - Theorem words_moved by reduction to Hessenberg
form, tridiagonal form, bidiagonal form, or one
sweep of QR iteration (or block versions of any
of these) ?( flops / M1/2 ) - Assuming Forward Progress (created zeros remain
zero) - Model Merge left and right orthogonal
transformations - A(i,j) A(i,j) - ?kLUL(i,kL) ZL(kL,j) -
?kRZR(i,kR) UR(j,kR)
80Summary of dense sequential Cholesky algorithms
attaining communication lower bounds
- Algorithms shown minimizing Messages assume
(recursive) block layout - Many references (see reports), only some shown,
plus ours - Oldest reference may or may not include
analysis - Cache-oblivious are underlined, Green are ours,
? is unknown/future work - b block size, M fast memory size, n
dimension
Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
Cholesky LAPACK (with b M1/2) Gustavson, 97 recursive, assumed good MatMul available BDHS09 BDHS09 like Gustavson,97 Ahmed,Pingali,00 Uses recursive Cholesky, TRSM and Matmul (?same) BDHS09 has analysis (?same) BDHS09 has analysis
81Summary of dense sequential QR algorithms
attaining communication lower bounds
- Algorithms shown minimizing Messages assume
(recursive) block layout - Many references (see reports), only some shown,
plus ours - Oldest reference may or may not include
analysis - Cache-oblivious are underlined, Green are ours,
? is unknown/future work - b block size, M fast memory size, n
dimension
Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
QR Elmroth, Gustavson,98 Recursive, may do more flops LAPACK (only for n?M and bM1/2, or n2?M) DGHL08 Frens,Wise,03 explicit Q only, 3x more flops DGHL08 implicit Q, but represented differently, same flop count Elmroth, Gustavson,98 DGHL08? Frens,Wise,03 DGHL08? Can we get the same flop count?
82Summary of dense sequential LU algorithms
attaining communication lower bounds
- Algorithms shown minimizing Messages assume
(recursive) block layout - Many references (see reports), only some shown,
plus ours - Oldest reference may or may not include
analysis - Cache-oblivious are underlined, Green are ours,
? is unknown/future work - b block size, M fast memory size, n
dimension
Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
LU with pivoting Toledo, 97 recursive, assumes good TRSM and Matmul available LAPACK (only for n?M and bM1/2, or n2?M) GDX08 Different than partial pivoting, still stable GDX08 Toledo, 97 GDX08? GDX08?
83Same idea for LU, almost GDX08
- First idea Use same reduction tree as in QR,
with LU at each node - Numerically unstable!
- Need a different pivoting scheme
- At each node in tree, TSLU selects b pivot rows
from 2b candidates from its 2 child nodes - At each node, do LU on 2b original rows selected
by child nodes, not U factors from child
nodes - When TSLU done, permute b selected rows to top of
original matrix, redo b steps of LU without
pivoting - Thm Same results as partial pivoting on
different input matrix whose entries are the same
magnitudes as original - CALU Communication Avoiding LU for general A
- Use TSLU for panel factorizations
- Apply to rest of matrix
- Cost redundant panel factorizations
- Benefit
- Stable in practice, but not same pivot choice as
GEPP - b times fewer messages overall - faster
84TSLU LU factorization of a tall skinny matrix
First try the obvious generalization of TSQR
85Growth factor for TSLU based factorization
- Unstable for large P and large matrices.
- When P rows, TSLU is equivalent to parallel
pivoting.
Courtesy of H. Xiang
86Growth factor for better CALU approach
Like threshold pivoting with worst case threshold
.33 , so L lt 3 Testing shows about same
residual as GEPP
87Performance vs ScaLAPACK
- TSLU (Tall-Skinny)
- IBM Power 5
- Up to 4.37x faster (16 procs, 1M x 150)
- Cray XT4
- Up to 5.52x faster (8 procs, 1M x 150)
- CALU (Square)
- IBM Power 5
- Up to 2.29x faster (64 procs, 1000 x 1000)
- Cray XT4
- Up to 1.81x faster (64 procs, 1000 x 1000)
- See INRIA Tech Report 6523 (2008), paper at SC08
88CALU speedup prediction for a Petascale machine -
up to 81x faster
P 8192
Petascale machine with 8192 procs, each at 500
GFlops/s, a bandwidth of 4 GB/s.
89Eigenvalue problems and SVD
- Uses spectral divide and conquer
- Idea goes back to Bulgakov, Godunov, Malyshev
- Roughly If A2k QR, leading columns of Q
converge to span invariant subspace for
eigenvalues of A outside unit circle so QAQ
becomes block upper triangular - Repeated squaring done implicitly
- Bai, D., Gu, 97
- Reduced just to Matmul, QR, and Generalized QR
with pivoting - Eliminated need for exp(A), other improvements
- DDH,07
- Introduced randomized rank revealing QR
- Q,Rqr(randn(n,n)), Q,Rqr(AQ)
- Just need Matmul and QR
- BDD,09
- Shows that randomized GQR also rank revealing
- Shows limits of divide-and-conquer, depending on
pseudospectrum - Cache-oblivious, using Frens,Wise,03
90Implicit Repeated Squaring
for each j,
91Sparse Linear Algebra
- Thm (George 73, Hoffman/Martin/Rose 73,
Eisenstat/ Schultz/Sherman 76 ) Cholesky on an
ns x ns sparse matrix whose graph is an
s-dimensional nearest neighbor grid does ?(
n3(s-1) ) flops attainable by nested dissection
ordering - Proof most work in dense Cholesky on final
submatrix of size ?(ns-1) - Corollary BDHS09,G09 Sequential Sparse
Cholesky on such a matrix moves ?( n3(s-1) /
M1/2 ) words, and Parallel Sparse Cholesky moves
?( n2(s-1) / (P log n )1/2 ) words on P procs
- Assumes 2D parallel algorithm, i.e. minimal
memory - Thm G09. Attainable in parallel case by nested
dissection, eg with PSPASES.
92Avoiding Communication in Iterative Linear Algebra
- k-steps of typical iterative solver for sparse
Axb or Ax?x - Does k SpMVs with starting vector (eg with b, if
solving Axb) - Finds best solution among all linear
combinations of these k1 vectors - Many such Krylov Subspace Methods
- Conjugate Gradients, GMRES, Lanczos, Arnoldi,
- Goal minimize communication in Krylov Subspace
Methods - Assume matrix well-partitioned, with modest
surface-to-volume ratio - Parallel implementation
- Conventional O(k log p) messages, because k
calls to SpMV - New O(log p) messages - optimal
- Serial implementation
- Conventional O(k) moves of data from slow to
fast memory - New O(1) moves of data optimal
- Lots of speed up possible (modeled and measured)
- Price some redundant computation
93Locally Dependent Entries for x,Ax, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
94Locally Dependent Entries for x,Ax,A2x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
95Locally Dependent Entries for x,Ax,,A3x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
96Locally Dependent Entries for x,Ax,,A4x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
97Locally Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication k8 fold
reuse of A
98Remotely Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
One message to get data needed to compute
remotely dependent entries, not k8 Minimizes
number of messages latency cost Price
redundant work ? surface/volume ratio
99Remotely Dependent Entries for x,Ax,A2x,A3x, A
irregular, multiple processors
100Sequential x,Ax,,A4x, with memory hierarchy
v
One read of matrix from slow memory, not
k4 Minimizes words moved bandwidth cost No
redundant work
101Performance results on 8-Core Clovertown