Title: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply
1CS 267 Applications of Parallel
ProcessorsLecture 13 Parallel Matrix Multiply
- Kathy Yelick
- http//www.cs.berkeley.edu/dmartin/cs267
2Outline
- Recap - Sources of large dense linear
systems - BLAS for linear algebra - Parallel
Matrix multiply
3Model overview
- Work-depth
- PRAM
- Latency/Bandwidth model
- a is the 1-time cost per message (latency)
- b is the per byte cost of communication
- Use this today
- LogP model
- correction gap should be greater than overhead
- more on this with parallel sorting
- Topology-specific models
4Dense Linear Algebra in Electromagentics
5Computational Electromagnetics
- developed during 1980s, driven by defense
applications - determine the RCS (radar cross section) of
airplane - reduce signature of plane (stealth technology)
- other applications are antenna design, medical
equipment - two fundamental numerical approaches
- MOM methods of moments ( frequency domain), and
- finite differences (time domain)
6Computational Electromagnetics
- discretize surface into triangular facets using
standard modeling tools - amplitude of currents
on surface are unknowns
- integral equation is discretized into a set of
linear equations
image NW Univ. Comp. Electromagnetics Laboratory
http//nueml.ece.nwu.edu/
7Computational Electromagnetics (MOM)
After discretization the integral equation has
the form Z J V where Z is the
impedance matrix, J is the unknown vector of
amplitudes, and V is the excitation
vector. (see Cwik, Patterson, and Scott,
Electromagnetic Scattering on the Intel
Touchstone Delta, IEEE Supercomputing 92, pp 538
- 542)
8Computational Electromagnetics (MOM)
The main steps in the solution process are A)
computing the matrix elements B) factoring the
dense matrix C) solving for one or more
excitations (RHS) D) computing the fields
scattered from the object
9Analysis of MOM for Parallel Implementation
Task Work Parallelism
Parallel Speed
Fill O(n2) embarrassing
low Factor O(n3)
moderately diff. very high Solve
O(n2) moderately diff.
high Field Calc. O(n) embarrassing
high
For most scientific applications the biggest gain
in performance is from parallelism within each
task.
10Results for Parallel Implementation on Delta
Task Time (hours)
Fill 9.20
Factor 8.25 Solve
2.17 Field
Calc. 0.12
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
11Current Records for Solving Dense Systems
Year System Size Machine 1950's
O(100) 1991 55,296
CM-2 1992 75,264 Intel
1993 75,264 Intel 1994
76,800 CM-5 1995
128,600 Paragon XP 1996
215,000 ASCI Red (Tflop)
source Alan Edelman http//www-math.mit.edu/edel
man/records.html
12Sources for large dense linear systems
- Not many basic factorizations outside CEM -
Large dense eigen problems used in chemistry -
Alternatives often debated
Choice for algorithms in existing codes are not
the result of careful planning and design. -
Reflect the start-of-the-art at the time, - May
be purely coincidental.
13Solving Large Dense Linear Systems
- Gaussian elimination to solve Axb
- where A is a dense matrix
- Add multiples of each row to subsequent rows in
order to create zeros below the diagonal - End up with an upper triangular matrix U.
- Solve a linear system with U by substitution,
starting with the last variable. - Solving these systems uses basic vector and
matrix operations called BLAS.
see Demmel http//HTTP.CS.Berkeley.EDU/demmel/cs2
67/lecture12/lecture12.html
14Parallel Matrix Multiply
15Parallel Matrix Multiply
- Computing CCAB
- Using basic algorithm 2n3 Flops
- Variables are
- Data layout
- Topology of machine
- Scheduling communication
- Use of performance models for algorithm design
161D Layout
- Assume matrices are nxn and n is divisible by p
- A(i) refers to the n by n/p block column that
processor i owns (similiarly for B(i) and C(i)) - B(i,j) is the n/p by n/p sublock of B(i)
- in rows jn/p through (j1)n/p
17Matrix Multiply 1D Layout on Bus
- Algorithm uses the formula
- C(i) C(i) AB(i) C(i) S A(j)B(j,i)
- First consider a bus-connected machine without
broadcast only one pair of processors can
communicate at a time - Second consider a bus-connected machine with
broadcast may send from one to many in single
step
j ltp j 0
18MatMul on 1D Bus without Broadcast
- Naïve algorithm
- C(myproc) C(myproc) A(myproc)b(myproc,my
proc) - for i 0 to p-1
- for j 0 to p-1 except i
- if (myproc i) send A(i) to
processor j // message passing - if (myproc j)
- receive A(i) from processor i
- C(myproc) C(myproc)
A(i)B(i,myproc) - barrier
- Cost of inner loop
- computation 2n(n/p)2 2n3/p2
- communication ap2 bpn2 //
approximately
19Naïve MatMul (continued)
- Cost of inner loop
- computation 2n(n/p)2 2n3/p2
- communication ap2 bpn2 //
approximately - Only 1 pair of processors (i and j) are active on
any iteration, - an of those, only i is doing computation
- gt the algorithm is almost
entirely serial - Running time (p(p-1) 1)computation
p(p-1)communication - 2n3 p2a pn2b
- this is worse than the serial time and grows
with p
20Better MatMul on a Bus
- Remove the barrier and send A(i) multiple times
- C(myproc) C(myproc) A(myproc)b(myproc,my
proc) - for i 0 to myproc-1
- receive A(i) from processor i
- C(myproc) C(myproc) A(i)B(i,myproc)
- for i 0 to p-1 except myproc
- send A(myproc) to processor i
- for i myproc1 to p-1
- receive A(i) from processor i
- C(myproc) C(myproc) A(i)B(i,myproc)
-
- Program is indeterminate sends/receives may
happen in different orders
21Performance of Better Algorithm
- Intuitively, if a computation step is sufficient
large compared to communication, efficiency will
be good - communication a n(n/p)b
- computation 2n3/p2
- Assume computation gt communication
- i-j represents communication
from i to j - iC represents computation step
by processor i
22Performance of Better Algorithm
- Timeline of execution
- If computationlt (p-2)communication
- No bubbles in the pipeline
- Time p(p-1)communication 2computation
- Time lt (p2 p-4)communication
- If communication computation/(p-2), close to
lower bound of 2n3/p - If communication is faster, only small impact on
performance
0C 0-1 0-2 0-3
0- p-1 1C 1C 1- 0 2C
2C 2-0 p-1C
23MatMul 1D Layout and Broadcast
- Modify the previous algorithm so that each of
sends of A(i) is a broadcast (assumed 1 comm.
step) - Time is now 2n3/p pa n2b
- Again, we require ngtgtp for good efficiency
- p times less communication time
- broadcast helps performance (as expected)
24MatMul with 2D Layout
- Consider processors in 2D grid (physical or
logical)
p(0,0) p(0,1) p(0,2)
p(1,0) p(1,1) p(1,2)
p(2,0) p(2,1) p(2,2)
25Cannons Algorithm
klts k0
- C(i,j) C(i,j) S A(i,k)B(k,j)
- Algorithm (s sqrt(p))
- for i0 to s-1 // skew A
- left-circular-shift row i of A by i
- so that A(i,j) overwritten by A(i,(ji)mod
s) - for i0 to s-1 // skew B
- up-circular shift B
- so that B(i,j) overwritten by B((ij)mod
s), j) - for k0 to s-1
- for i0 to s-1 and j0 to s-1
- C(i,j) C(i,j) A(i,j)B(i,j)
- left-circular-shift each row of A by 1
- up-circular-shift each row of B by 1
-
26Communication in Cannon
27BLAS Review
28Fast linear algebra kernels BLAS
- Simple linear algebra kernels such as
matrix-matrix multiply - More complicated algorithms can be built from
these basic kernels. - The interfaces of these kernels have been
standardized as the Basic Linear Algebra
Subroutines (BLAS). - Early agreement on standard interface (1980)
- Led to portable libraries for vector and shared
memory parallel machines. - On distributed memory, there is a less-standard
interface called the PBLAS
29Level 1 BLAS
- Operate on vectors or pairs of vectors
- perform O(n) operations
- return either a vector or a scalar.
- saxpy
- y(i) a x(i) y(i), for i1 to n.
- s stands for single precision, daxpy is for
double precision, caxpy for complex, and zaxpy
for double complex, - sscal y a x, for scalar a and vectors x,y
- sdot computes s S ni1 x(i)y(i)
30Level 2 BLAS
- Operate on a matrix and a vector
- return a matrix or a vector
- O(n2) operations
- sgemv matrix-vector multiply
- y y Ax
- where A is m-by-n, x is n-by-1 and y is m-by-1.
- sger rank-one update
- A A yx', i.e., A(i,j) A(i,j)y(i)x(j)
- where A is m-by-n, y is m-by-1, x is n-by-1, x'
is x transpose - strsv triangular solve
- solves yTx for x, where T is triangular
31Level 3 BLAS
- Operate on pairs or triples of matrices
- returning a matrix
- complexity is O(n3).
- sgemm Matrix-matrix multiplication
- C C AB,
- where C is m-by-n, A is m-by-k, and B is k-by-n
- sgtrsm multiple triangular solve
- solves Y TX for X,
- where T is a triangular matrix, and X is a
rectangular matrix.
32Performance of BLAS
Level 3
Level 2
Level 1
33Performance of BLAS
- BLAS are specially optimized by the vendor
- Sun BLAS uses features in the Ultrasparc
- Big payoff for algorithms that can be expressed
in terms of the BLAS3 instead of BLAS2 or BLAS1. - The top speed of the BLAS3
- Algorithms like Gaussian elimination organized so
that they use BLAS3