CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

1
CS 267Dense Linear AlgebraHistory and
Structure,Parallel Matrix Multiplication

James Demmel
www.cs.berkeley.edu/demmel/cs267_Spr09

2
Outline

History and motivation
Structure of the Dense Linear Algebra motif
Parallel Matrix-matrix multiplication
Parallel Gaussian Elimination (next time)

3
Outline

History and motivation
Structure of the Dense Linear Algebra motif
Parallel Matrix-matrix multiplication
Parallel Gaussian Elimination (next time)

4
Motifs
The Motifs (formerly Dwarfs) from The
Berkeley View (Asanovic et al.) Motifs form key
computational patterns
5
A brief history of (Dense) Linear Algebra
software (1/6)

In the beginning was the do-loop

Libraries like EISPACK (for eigenvalue problems)
Then the BLAS (1) were invented (1973-1977)
Standard library of 15 operations (mostly) on
vectors
AXPY ( y ax y ), dot product, scale (x
ax ), etc
Up to 4 versions of each (S/D/C/Z), 46 routines,
3300 LOC
Goals
Common pattern to ease programming,
readability, self-documentation
Robustness, via careful coding (avoiding
over/underflow)
Portability Efficiency via machine specific
implementations
Why BLAS 1 ? They do O(n1) ops on O(n1) data
Used in libraries like LINPACK (for linear
systems)
Source of the name LINPACK Benchmark (not the
code!)

6
Current Records for Solving Dense Systems
(11/2008)
www.netlib.org, click on Performance Database
Server

Gigaflops Machine n100
n1000 Any n Peak Roadrunner (LANL)
1.1M
1.5M (1.1 Petaflops, 130K cores, 2.5 MW,
n2.3M)

NEC SX 8
(fastest in 2005) (8 proc, 2 GHz)
75.1
128 (1 proc, 2 GHz) 2.2
15.0 16
Palm Pilot III .00000169
(1.69 Kiloflops)
7
A brief history of (Dense) Linear Algebra
software (2/6)

But the BLAS-1 werent enough
Consider AXPY ( y ax y ) 2n flops on 3n
read/writes
Computational intensity (2n)/(3n) 2/3
Too low to run near peak speed (read/write
dominates)
Hard to vectorize (SIMDize) on supercomputers
of the day (1980s)
So the BLAS-2 were invented (1984-1986)
Standard library of 25 operations (mostly) on
matrix/vector pairs
GEMV y aAx ßx, GER A A axyT,
x T-1x
Up to 4 versions of each (S/D/C/Z), 66 routines,
18K LOC
Why BLAS 2 ? They do O(n2) ops on O(n2) data
So computational intensity still just (2n2)/(n2)
2
OK for vector machines, but not for machine with
caches

8
A brief history of (Dense) Linear Algebra
software (3/6)

The next step BLAS-3 (1987-1988)
Standard library of 9 operations (mostly) on
matrix/matrix pairs
GEMM C aAB ßC, C aAAT ßC, B
T-1B
Up to 4 versions of each (S/D/C/Z), 30 routines,
10K LOC
Why BLAS 3 ? They do O(n3) ops on O(n2) data
So computational intensity (2n3)/(4n2) n/2
big at last!
Good for machines with caches, other mem.
hierarchy levels
How much BLAS1/2/3 code so far (all at
www.netlib.org/blas)
Source 142 routines, 31K LOC, Testing 28K
LOC
Reference (unoptimized) implementation only
Ex 3 nested loops for GEMM
Lots more optimized code (eg Homework 1)
Motivates automatic tuning of the BLAS (details
later)

9
(No Transcript)
10
A brief history of (Dense) Linear Algebra
software (4/6)

LAPACK Linear Algebra PACKage - uses BLAS-3
(1989 now)
Ex Obvious way to express Gaussian Elimination
(GE) is adding multiples of one row to other rows
BLAS-1
How do we reorganize GE to use BLAS-3 ? (details
later)
Contents of LAPACK (summary)
Algorithms we can turn into (nearly) 100 BLAS 3
Linear Systems solve Axb for x
Least Squares choose x to minimize r2 ? ?S
ri2 where rAx-b
Algorithms we can only make up to 50 BLAS 3 (so
far)
Eigenproblems Find l and x where Ax l x
Singular Value Decomposition (SVD) ATAx?2x
Error bounds for everything
Lots of variants depending on As structure
(banded, AAT, etc)
How much code? (Release 3.2, Nov 2008)
(www.netlib.org/lapack)
Source 1582 routines, 490K LOC, Testing 352K
LOC
Ongoing development (at UCB and elsewhere) (class
projects!)

11
A brief history of (Dense) Linear Algebra
software (5/6)

Is LAPACK parallel?
Only if the BLAS are parallel (possible in shared
memory)
ScaLAPACK Scalable LAPACK (1995 now)
For distributed memory uses MPI
More complex data structures, algorithms than
LAPACK
Only (small) subset of LAPACKs functionality
available
Details later (class projects!)
All at www.netlib.org/scalapack

12
Success Stories for Sca/LAPACK

Widely used
Adopted by Mathworks, Cray, Fujitsu, HP, IBM,
IMSL, Intel, NAG, NEC, SGI,
gt96M web hits(in 2009, 56M in 2006) _at_ Netlib
(incl. CLAPACK, LAPACK95)
New Science discovered through the solution of
dense matrix systems
Nature article on the flat universe used
ScaLAPACK
Other articles in Physics Review B that also use
it
1998 Gordon Bell Prize
www.nersc.gov/news/reports/newNERSCresults050703.p
df

Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
ScaLAPACK
13
A brief history of (Dense) Linear Algebra
software (6/6)

PLASMA and MAGMA (now)
Planned extensions to Multicore/GPU/Heterogeneous
Can one software infrastructure accommodate all
algorithms and platforms of current (future)
interest?
How much code generation and tuning can we
automate?
Details later (Class projects!)
Other related projects
BLAST Forum (www.netlib.org/blas/blast-forum)
Attempt to extend BLAS to other languages, add
some new functions, sparse matrices,
extra-precision, interval arithmetic
Only partly successful (extra-precise BLAS used
in latest LAPACK)
FLAME (www.cs.utexas.edu/users/flame/)
Formal Linear Algebra Method Environment
Attempt to automate code generation across
multiple platforms

14
Outline

History and motivation
Structure of the Dense Linear Algebra motif
Parallel Matrix-matrix multiplication
Parallel Gaussian Elimination (next time)

15
What could go into the linear algebra motif(s)?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy (including condition
estimates, etc)
Need to prioritize, automate!
16
For all linear algebra problemsEx LAPACK Table
of Contents

Linear Systems
Least Squares
Overdetermined, underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax ?x), Generalized (Ax?xB)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant
subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Level of detail
Simple Driver
Expert Drivers with error bounds,
extra-precision, other options
Lower level routines (apply certain kind of
orthogonal transformation)

17
For all matrix/problem structuresEx LAPACK
Table of Contents

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

18
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

19
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

20
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

21
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

22
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

23
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

24
Organizing Linear Algebra in books
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.cs.utk.edu/dongarra/etemplates
www.netlib.org/templates
25
Outline

History and motivation
Structure of the Dense Linear Algebra motif
Parallel Matrix-matrix multiplication
Parallel Gaussian Elimination (next time)

26
Different Parallel Data Layouts for Matrices (not
all!)
1) 1D Column Blocked Layout
2) 1D Column Cyclic Layout
4) Row versions of the previous layouts
b
3) 1D Column Block Cyclic Layout
Generalizes others
6) 2D Row and Column Block Cyclic Layout
5) 2D Row and Column Blocked Layout
27
Parallel Matrix-Vector Product

Compute y y Ax, where A is a dense matrix
Layout
1D row blocked
A(i) refers to the n by n/p block row
that
processor i owns,
x(i) and y(i) similarly refer to
segments
of x,y owned by i
Algorithm
Foreach processor i
Broadcast x(i)
Compute y(i) A(i)x
Algorithm uses the formula
y(i) y(i) A(i)x y(i) Sj A(i,j)x(j)

P0 P1 P2 P3
x
P0 P1 P2 P3
A(0)
y
A(1)
A(2)
A(3)
28
Matrix-Vector Product y y Ax

A column layout of the matrix eliminates the
broadcast of x
But adds a reduction to update the destination y
A 2D blocked layout uses a broadcast and
reduction, both on a subset of processors
sqrt(p) for square processor grid

P0 P1 P2 P3
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
29
Parallel Matrix Multiply

Computing CCAB
Using basic algorithm 2n3 Flops
Variables are
Data layout
Topology of machine
Scheduling communication
Use of performance models for algorithm design
Message Time latency words time-per-word
a nb
Efficiency (in any model)
serial time / (p parallel time)
perfect (linear) speedup ? efficiency 1

30
Matrix Multiply with 1D Column Layout

Assume matrices are n x n and n is divisible by p
A(i) refers to the n by n/p block column that
processor i owns (similiarly for B(i) and C(i))
B(i,j) is the n/p by n/p sublock of B(i)
in rows jn/p through (j1)n/p
Algorithm uses the formula
C(i) C(i) AB(i) C(i) Sj A(j)B(j,i)

May be a reasonable assumption for analysis, not
for code
31
Matrix Multiply 1D Layout on Bus or Ring

Algorithm uses the formula
C(i) C(i) AB(i) C(i) Sj A(j)B(j,i)
First consider a bus-connected machine without
broadcast only one pair of processors can
communicate at a time (ethernet)
Second consider a machine with processors on a
ring all processors may communicate with nearest
neighbors simultaneously

32
MatMul 1D layout on Bus without Broadcast

Naïve algorithm
C(myproc) C(myproc) A(myproc)B(myproc,myp
roc)
for i 0 to p-1
for j 0 to p-1 except i
if (myproc i) send A(i) to
processor j
if (myproc j)
receive A(i) from processor i
C(myproc) C(myproc)
A(i)B(i,myproc)
barrier
Cost of inner loop
computation 2n(n/p)2 2n3/p2
communication a bn2 /p

33
Naïve MatMul (continued)

Cost of inner loop
computation 2n(n/p)2 2n3/p2
communication a bn2 /p
approximately
Only 1 pair of processors (i and j) are active on
any iteration,
and of those, only i is doing computation
gt the algorithm is almost
entirely serial
Running time
(p(p-1) 1)computation
p(p-1)communication
? 2n3 p2a pn2b
This is worse than the serial time and grows
with p.
When might you still want to do this?

34
Matmul for 1D layout on a Processor Ring

Pairs of adjacent processors can communicate
simultaneously

Copy A(myproc) into Tmp C(myproc) C(myproc)
TmpB(myproc , myproc) for j 1 to p-1
Send Tmp to processor myproc1 mod p
Receive Tmp from processor myproc-1 mod p
C(myproc) C(myproc) TmpB( myproc-j mod p ,
myproc)

Same idea as for gravity in simple sharks and
fish algorithm
May want double buffering in practice for overlap
Ignoring deadlock details in code
Time of inner loop 2(a bn2/p) 2n(n/p)2

35
Matmul for 1D layout on a Processor Ring

Time of inner loop 2(a bn2/p) 2n(n/p)2
Total Time 2n (n/p)2 (p-1) Time of
inner loop
? 2n3/p 2pa 2bn2
(Nearly) Optimal for 1D layout on Ring or Bus,
even with Broadcast
Perfect speedup for arithmetic
A(myproc) must move to each other processor,
costs at least
(p-1)cost of sending n(n/p)
words
Parallel Efficiency 2n3 / (p Total Time)
1/(1 a
p2/(2n3) b p/(2n) )
1/ (1 O(p/n))
Grows to 1 as n/p increases (or a and b shrink)

36
MatMul with 2D Layout

Consider processors in 2D grid (physical or
logical)
Processors can communicate with 4 nearest
neighbors
Broadcast along rows and columns
Assume p processors form square s x s grid, s
p1/2

p(0,0) p(0,1) p(0,2)
p(0,0) p(0,1) p(0,2)
p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)
p(1,0) p(1,1) p(1,2)
p(1,0) p(1,1) p(1,2)
p(2,0) p(2,1) p(2,2)
p(2,0) p(2,1) p(2,2)
p(2,0) p(2,1) p(2,2)
37
Cannons Algorithm

C(i,j) C(i,j) S A(i,k)B(k,j)
assume s sqrt(p) is an integer
forall i0 to s-1 skew A
left-circular-shift row i of A by i
so that A(i,j) overwritten by
A(i,(ji)mod s)
forall i0 to s-1 skew B
up-circular-shift column i of B by i
so that B(i,j) overwritten by
B((ij)mod s), j)
for k0 to s-1 sequential
forall i0 to s-1 and j0 to s-1
all processors in parallel
C(i,j) C(i,j) A(i,j)B(i,j)
left-circular-shift each row of A
by 1
up-circular-shift each column of B
by 1

k
38
Cannons Matrix Multiplication
C(1,2) A(1,0) B(0,2) A(1,1) B(1,2)
A(1,2) B(2,2)
39
Initial Step to Skew Matrices in Cannon

Initial blocked input
After skewing before initial block multiplies

B(0,1)
B(0,2)
B(0,0)
A(0,1)
A(0,2)
A(0,0)
B(1,0)
B(1,1)
B(1,2)
A(1,0)
A(1,1)
A(1,2)
B(2,0)
B(2,1)
B(2,2)
A(2,0)
A(2,1)
A(2,2)
A(0,1)
A(0,2)
A(0,0)
B(1,1)
B(2,2)
B(0,0)
A(1,0)
A(1,1)
A(1,2)
B(0,2)
B(1,0)
B(2,1)
A(2,0)
A(2,1)
A(2,2)
B(0,1)
B(2,0)
B(1,2)
40
Skewing Steps in CannonAll blocks of A must
multiply all like-colored blocks of B

First step
Second
Third

A(0,1)
A(0,2)
B(0,2)
B(1,0)
B(2,1)
A(0,0)
A(1,0)
A(1,2)
B(0,1)
B(2,0)
B(1,2)
A(1,1)
A(2,0)
A(2,1)
B(1,1)
B(2,2)
B(0,0)
A(2,2)
A(0,1)
A(0,2)
A(0,0)
B(0,1)
B(2,0)
B(1,2)
A(1,0)
A(1,1)
A(1,2)
B(1,1)
B(2,2)
B(0,0)
A(2,0)
A(2,1)
A(2,2)
B(0,2)
B(1,0)
B(2,1)
41
Cost of Cannons Algorithm

forall i0 to s-1 recall s
sqrt(p)
left-circular-shift row i of A by i
cost s(a bn2/p)
forall i0 to s-1
up-circular-shift column i of B by i
cost s(a bn2/p)
for k0 to s-1
forall i0 to s-1 and j0 to s-1
C(i,j) C(i,j) A(i,j)B(i,j)
cost 2(n/s)3 2n3/p3/2
left-circular-shift each row of A
by 1 cost a bn2/p
up-circular-shift each column of B
by 1 cost a bn2/p

Total Time 2n3/p 4 s? 4?n2/s
Parallel Efficiency 2n3 / (p Total Time)
1/( 1 a
2(s/n)3 b 2(s/n) )
1/(1
O(sqrt(p)/n))
Grows to 1 as n/s n/sqrt(p) sqrt(data per
processor) grows
Better than 1D layout, which had Efficiency
1/(1 O(p/n))

42
Cannons Algorithm is optimal

Optimal means
Considering only O(n3) matmul algs (not Strassen)
Considering only O(n2/p) storage per processor
Use Irony/Toledo/Tiskin result from Lecture 2
A processor doing n3 flops from matmul with a
fast memory of size M must do ? (n3 / M1/2 )
references to slow memory
Same proof a processor doing f flops from matmul
with a local memory of size M must do ? (f / M1/2
) references to remote memory
Local memory O(n2/p) , f ? n3/p for at least
one proc.
So f / M1/2 ? ((n3/p)/ (n2/p)1/2 ) ? (
n2/p1/2 )
Messages ? ( words sent / max message size)
? ( (n2/p1/2)/(n2/p)) ? (p1/2)

43
Pros and Cons of Cannon

So what if its optimal, is it fast?
Local computation one call to (optimized)
matrix-multiply
Hard to generalize for
p not a perfect square
A and B not square
Dimensions of A, B not perfectly divisible by
ssqrt(p)
A and B not aligned in the way they are stored
on processors
block-cyclic layouts
Memory hog (extra copies of local matrices)

44
SUMMA Algorithm

SUMMA Scalable Universal Matrix Multiply
Slightly less efficient, but simpler and easier
to generalize
Presentation from van de Geijn and Watts
www.netlib.org/lapack/lawns/lawn96.ps
Similar ideas appeared many times
Used in practice in PBLAS Parallel BLAS
www.netlib.org/lapack/lawns/lawn100.ps

45
SUMMA
B(k,j)
j
k
k

C(i,j)
i
A(i,k)

i, j represent all rows, columns owned by a
processor
k is a block of b ? 1 rows or columns
C(i,j) C(i,j) Sk A(i,k)B(k,j)
Assume a pr by pc processor grid (pr pc 4
above)
Need not be square

46
SUMMA
B(k,j)
j
k
k

C(i,j)
i
A(i,k)
For k0 to n-1 or n/b-1 where b is the
block size
cols in A(i,k) and rows in B(k,j) for all
i 1 to pr in parallel owner of
A(i,k) broadcasts it to whole processor row
for all j 1 to pc in parallel
owner of B(k,j) broadcasts it to whole processor
column Receive A(i,k) into Acol Receive
B(k,j) into Brow C_myproc C_myproc
Acol Brow
47
SUMMA performance

To simplify analysis only, assume s sqrt(p)

For k0 to n/b-1 for all i 1 to s s
sqrt(p) owner of A(i,k) broadcasts it
to whole processor row time
log s ( a b bn/s), using a tree for
all j 1 to s owner of B(k,j)
broadcasts it to whole processor column
time log s ( a b bn/s), using a
tree Receive A(i,k) into Acol Receive
B(k,j) into Brow C_myproc C_myproc Acol
Brow time 2(n/s)2b

Total time 2n3/p a log p n/b b
log p n2 /s

48
SUMMA performance

Total time 2n3/p a log p n/b
b log p n2 /s
Parallel Efficiency
1/(1 a log p p / (2bn2) b log
p s/(2n) )
?Same b term as Cannon, except for log p factor
log p grows slowly so this is ok
Latency (a) term can be larger, depending on b
When b1, get a log p n
As b grows to n/s, term shrinks to
a log p s (log p times
Cannon)
Temporary storage grows like 2bn/s
Can change b to tradeoff latency cost with memory

49
PDGEMM PBLAS routine for matrix
multiply Observations For fixed N, as P
increases Mflops increases, but
less than 100 efficiency For fixed P, as N
increases, Mflops (efficiency) rises
DGEMM BLAS routine for matrix
multiply Maximum speed for PDGEMM Procs
speed of DGEMM Observations (same as above)
Efficiency always at least 48 For fixed
N, as P increases, efficiency drops
For fixed P, as N increases, efficiency
increases
50
Summary of Parallel Matrix Multiplication

1D Layout
Bus without broadcast - slower than serial
Nearest neighbor communication on a ring (or bus
with broadcast) Efficiency 1/(1 O(p/n))
2D Layout
Cannon
Efficiency 1/(1O(a ( sqrt(p) /n)3 b
sqrt(p) /n)) optimal!
Hard to generalize for general p, n, block
cyclic, alignment
SUMMA
Efficiency 1/(1 O(a log p p / (bn2)
blog p sqrt(p) /n))
Very General
b small gt less memory, lower efficiency
b large gt more memory, high efficiency
Used in practice (PBLAS)

51
ScaLAPACK Parallel Library
52
Extra Slides
53
Recursive Layouts

For both cache hierarchies and parallelism,
recursive layouts may be useful
Z-Morton, U-Morton, and X-Morton Layout
Also Hilbert layout and others
What about the users view?
Fortunately, many problems can be solved on a
permutation
Never need to actually change the users layout

54
Gaussian Elimination
x
0
x
. . .
x
x
Standard Way subtract a multiple of a row
Slide source Dongarra
55
Gaussian Elimination via a Recursive Algorithm
F. Gustavson and S. Toledo
LU Algorithm 1 Split matrix into two
rectangles (m x n/2) if only 1 column,
scale by reciprocal of pivot return 2
Apply LU Algorithm to the left part 3 Apply
transformations to right part
(triangular solve A12 L-1A12 and
matrix multiplication A22A22 -A21A12
) 4 Apply LU Algorithm to right part
Most of the work in the matrix multiply Matrices
of size n/2, n/4, n/8,
Slide source Dongarra
56
Recursive Factorizations

Just as accurate as conventional method
Same number of operations
Automatic variable blocking
Level 1 and 3 BLAS only !
Extreme clarity and simplicity of expression
Highly efficient
The recursive formulation is just a rearrangement
of the point-wise LINPACK algorithm
The standard error analysis applies (assuming the
matrix operations are computed the conventional
way).

Slide source Dongarra
57

Recursive LU

Dual-processor
LAPACK
Recursive LU
LAPACK
Uniprocessor
Slide source Dongarra
58
Review BLAS 3 (Blocked) GEPP
for ib 1 to n-1 step b Process matrix b
columns at a time end ib b-1
Point to end of block of b columns
apply BLAS2 version of GEPP to get A(ibn ,
ibend) P L U let LL denote the
strict lower triangular part of A(ibend ,
ibend) I A(ibend , end1n) LL-1
A(ibend , end1n) update next b rows
of U A(end1n , end1n ) A(end1n ,
end1n ) - A(end1n , ibend)
A(ibend , end1n)
apply delayed updates with
single matrix-multiply
with inner dimension b
BLAS 3
59
Review Row and Column Block Cyclic Layout
processors and matrix blocks are distributed in a
2d array pcol-fold parallelism in any column,
and calls to the BLAS2 and BLAS3 on matrices of
size brow-by-bcol serial bottleneck is
eased need not be symmetric in rows and columns
60
Distributed GE with a 2D Block Cyclic Layout
block size b in the algorithm and the block sizes
brow and bcol in the layout satisfy bbrowbcol.
shaded regions indicate busy processors or
communication performed. unnecessary to have a
barrier between each step of the algorithm,
e.g.. step 9, 10, and 11 can be pipelined
61
Distributed GE with a 2D Block Cyclic Layout
62
Matrix multiply of green green - blue pink
63
PDGESV ScaLAPACK parallel LU
routine Since it can run no faster than its
inner loop (PDGEMM), we measure Efficiency
Speed(PDGESV)/Speed(PDGEMM) Observations
Efficiency well above 50 for large
enough problems For fixed N, as P
increases, efficiency decreases
(just as for PDGEMM) For fixed P, as N
increases efficiency increases
(just as for PDGEMM) From bottom table, cost
of solving Axb about half of matrix
multiply for large enough matrices.
From the flop counts we would
expect it to be (2n3)/(2/3n3) 3
times faster, but communication makes
it a little slower.
64
(No Transcript)
65
Scales well, nearly full machine speed
66
Old version, pre 1998 Gordon Bell Prize Still
have ideas to accelerate Project Available!
Old Algorithm, plan to abandon
67
Have good ideas to speedup Project available!
Hardest of all to parallelize Have alternative,
and would like to compare Project available!
68
Out-of-core means matrix lives on disk too
big for main mem Much harder to hide latency
of disk QR much easier than LU because no
pivoting needed for QR Moral use QR to solve
Axb Projects available (perhaps very hard)
69
A small software project ...
70
Work-Depth Model of Parallelism

The work depth model
The simplest model is used
For algorithm design, independent of a machine
The work, W, is the total number of operations
The depth, D, is the longest chain of
dependencies
The parallelism, P, is defined as W/D
Specific examples include
circuit model, each input defines a graph with
ops at nodes
vector model, each step is an operation on a
vector of elements
language model, where set of operations defined
by language

71
Latency Bandwidth Model

Network of fixed number P of processors
fully connected
each with local memory
Latency (a)
accounts for varying performance with number of
messages
gap (g) in logP model may be more accurate cost
if messages are pipelined
Inverse bandwidth (b)
accounts for performance varying with volume of
data
Efficiency (in any model)
serial time / (p parallel time)
perfect (linear) speedup ? efficiency 1

72
Initial Step to Skew Matrices in Cannon

Initial blocked input
After skewing before initial block multiplies

B(0,1)
B(0,2)
B(0,0)
A(0,1)
A(0,2)
A(0,0)
B(1,0)
B(1,1)
B(1,2)
A(1,0)
A(1,1)
A(1,2)
B(2,0)
B(2,1)
B(2,2)
A(2,0)
A(2,1)
A(2,2)
A(0,1)
A(0,2)
A(0,0)
B(1,1)
B(2,2)
B(0,0)
A(1,0)
A(1,1)
A(1,2)
B(0,2)
B(1,0)
B(2,1)
A(2,0)
A(2,1)
A(2,2)
B(0,1)
B(2,0)
B(1,2)
73
Skewing Steps in Cannon
A(0,1)
A(0,2)
A(0,0)
B(1,1)
B(2,2)
B(0,0)

First step
Second
Third

A(1,0)
A(1,1)
A(1,2)
B(0,2)
B(1,0)
B(2,1)
A(2,0)
A(2,1)
A(2,2)
B(0,1)
B(2,0)
B(1,2)
A(0,1)
A(0,2)
B(0,2)
B(1,0)
B(2,1)
A(0,0)
A(1,0)
A(1,2)
B(0,1)
B(2,0)
B(1,2)
A(1,1)
A(2,0)
A(2,1)
B(1,1)
B(2,2)
B(0,0)
A(2,2)
A(0,1)
A(0,2)
A(0,0)
B(0,1)
B(2,0)
B(1,2)
A(1,0)
A(1,1)
A(1,2)
B(1,1)
B(2,2)
B(0,0)
A(2,0)
A(2,1)
A(2,2)
B(0,2)
B(1,0)
B(2,1)
74
Motivation (1)

3 Basic Linear Algebra Problems
Linear Equations Solve Axb for x
Least Squares Find x that minimizes r2 ? ?S
ri2 where rAx-b
Statistics Fitting data with simple functions
3a. Eigenvalues Find l and x where Ax l x
Vibration analysis, e.g., earthquakes, circuits
3b. Singular Value Decomposition ATAx?2x
Data fitting, Information retrieval
Lots of variations depending on structure of A
A symmetric, positive definite, banded,

75
Motivation (2)

Why dense A, as opposed to sparse A?
Many large matrices are sparse, but
Dense algorithms easier to understand
Some applications yields large dense matrices
LINPACK Benchmark (www.top500.org)
How fast is your computer?
How fast can you solve
dense Axb?
Large sparse matrix algorithms often yield
smaller (but still large) dense problems
Do ParLab Apps most use small dense matrices?

76
Algorithms for 2D (3D) Poisson Equation (N n2
(n3) vars)

Algorithm Serial PRAM Memory Procs
Dense LU N3 N N2 N2
Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3)
Jacobi N2 (N5/3) N (N2/3) N N
Explicit Inv. N2 log N N2 N2
Conj.Gradients N3/2 (N4/3) N1/2(1/3) log
N N N
Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N
Sparse LU N3/2 (N2) N1/2 Nlog N (N4/3) N
FFT Nlog N log N N N
Multigrid N log2 N N N
Lower bound N log N N
PRAM is an idealized parallel model with zero
cost communication
Reference James Demmel, Applied Numerical
Linear Algebra, SIAM, 1997
(Note corrected complexities for 3D case from
last lecture!).

77
Lessons and Questions (1)

Structure of the problem matters
Cost of solution can vary dramatically (n3 to n)
Many other examples
Some structure can be figured out automatically
A\b can figure out symmetry, some sparsity
Some structures known only to (smart) user
If performance not critical, user may be happy to
settle for A\b
How much of this goes into the motifs?
How much should we try to help user choose?
Tuning, but at algorithmic choice level (SALSA)
Motifs overlap
Dense, sparse, (un)structured grids, spectral

78
Organizing Linear Algebra (1)

By Operations
Low level (eg mat-mul BLAS)
Standard level (eg solve Axb, Ax?x
Sca/LAPACK)
Applications level (eg systems control
SLICOT)
By Performance/accuracy tradeoffs
Direct methods with guarantees vs iterative
methods that may work faster and accurately
enough
By Structure
Storage
Dense
columnwise, rowwise, 2D block cyclic, recursive
space-filling curves
Banded, sparse (many flavors), black-box,
Mathematical
Symmetries, positive definiteness, conditioning,
As diverse as the world being modeled

79
Organizing Linear Algebra (2)

By Data Type
Real vs Complex
Floating point (fixed or varying length), other
By Target Platform
Serial, manycore, GPU, distributed memory,
out-of-DRAM, Grid,
By programming interface
Language bindings
A\b versus access to details

80
For all linear algebra problemsEx LAPACK Table
of Contents

Linear Systems
Least Squares
Overdetermined, underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax ?x), Generallzed (Ax?xB)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant
subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Level of detail
Simple Driver
Expert Drivers with error bounds,
extra-precision, other options
Lower level routines (apply certain kind of
orthogonal transformation)

81
For all matrix/problem structuresEx LAPACK
Table of Contents

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

82
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

83
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

84
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

85
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

86
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general , pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

87
For all matrix/problem structuresEx LAPACK
Table of Contents

BD bidiagonal
GB general banded
GE general
GG general, pair
GT tridiagonal
HB Hermitian banded
HE Hermitian
HG upper Hessenberg, pair
HP Hermitian, packed
HS upper Hessenberg
OR (real) orthogonal
OP (real) orthogonal, packed
PB positive definite, banded
PO positive definite
PP positive definite, packed
PT positive definite, tridiagonal

SB symmetric, banded
SP symmetric, packed
ST symmetric, tridiagonal
SY symmetric
TB triangular, banded
TG triangular, pair
TB triangular, banded
TP triangular, packed
TR triangular
TZ trapezoidal
UN unitary
UP unitary packed

88
For all data typesEx LAPACK Table of Contents

Real and complex
Single and double precision
Arbitrary precision in progress

89
Organizing Linear Algebra (3)
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.cs.utk.edu/dongarra/etemplates
www.netlib.org/templates
90
Review of the BLAS

Building blocks for all linear algebra
Parallel versions call serial versions on each
processor
So they must be fast!
Define q flops / mem refs computational
intensity
The larger is q, the faster the algorithm can go
in the
presence of memory hierarchy
axpy y ax y, where a scalar, x and y
vectors

Write a Comment

User Comments (0)

About PowerShow.com

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication PowerPoint PPT Presentation