CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

1
CS 267 Applications of Parallel
ProcessorsLecture 13 Parallel Matrix Multiply

Kathy Yelick
http//www.cs.berkeley.edu/dmartin/cs267

2
Outline
- Recap - Sources of large dense linear
systems - BLAS for linear algebra - Parallel
Matrix multiply
3
Model overview

Work-depth
PRAM
Latency/Bandwidth model
a is the 1-time cost per message (latency)
b is the per byte cost of communication
Use this today
LogP model
correction gap should be greater than overhead
more on this with parallel sorting
Topology-specific models

4
Dense Linear Algebra in Electromagentics
5
Computational Electromagnetics

developed during 1980s, driven by defense
applications
determine the RCS (radar cross section) of
airplane
reduce signature of plane (stealth technology)
other applications are antenna design, medical
equipment
two fundamental numerical approaches
MOM methods of moments ( frequency domain), and
finite differences (time domain)

6
Computational Electromagnetics
- discretize surface into triangular facets using
standard modeling tools - amplitude of currents
on surface are unknowns
- integral equation is discretized into a set of
linear equations
image NW Univ. Comp. Electromagnetics Laboratory
http//nueml.ece.nwu.edu/
7
Computational Electromagnetics (MOM)
After discretization the integral equation has
the form Z J V where Z is the
impedance matrix, J is the unknown vector of
amplitudes, and V is the excitation
vector. (see Cwik, Patterson, and Scott,
Electromagnetic Scattering on the Intel
Touchstone Delta, IEEE Supercomputing 92, pp 538
- 542)
8
Computational Electromagnetics (MOM)
The main steps in the solution process are A)
computing the matrix elements B) factoring the
dense matrix C) solving for one or more
excitations (RHS) D) computing the fields
scattered from the object
9
Analysis of MOM for Parallel Implementation
Task Work Parallelism
Parallel Speed
Fill O(n2) embarrassing
low Factor O(n3)
moderately diff. very high Solve
O(n2) moderately diff.
high Field Calc. O(n) embarrassing
high
For most scientific applications the biggest gain
in performance is from parallelism within each
task.
10
Results for Parallel Implementation on Delta
Task Time (hours)
Fill 9.20
Factor 8.25 Solve
2.17 Field
Calc. 0.12
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
11
Current Records for Solving Dense Systems
Year System Size Machine 1950's
O(100) 1991 55,296
CM-2 1992 75,264 Intel
1993 75,264 Intel 1994
76,800 CM-5 1995
128,600 Paragon XP 1996
215,000 ASCI Red (Tflop)

source Alan Edelman http//www-math.mit.edu/edel
man/records.html
12
Sources for large dense linear systems
- Not many basic factorizations outside CEM -
Large dense eigen problems used in chemistry -
Alternatives often debated
Choice for algorithms in existing codes are not
the result of careful planning and design. -
Reflect the start-of-the-art at the time, - May
be purely coincidental.
13
Solving Large Dense Linear Systems

Gaussian elimination to solve Axb
where A is a dense matrix
Add multiples of each row to subsequent rows in
order to create zeros below the diagonal
End up with an upper triangular matrix U.
Solve a linear system with U by substitution,
starting with the last variable.
Solving these systems uses basic vector and
matrix operations called BLAS.

see Demmel http//HTTP.CS.Berkeley.EDU/demmel/cs2
67/lecture12/lecture12.html
14
Parallel Matrix Multiply
15
Parallel Matrix Multiply

Computing CCAB
Using basic algorithm 2n3 Flops
Variables are
Data layout
Topology of machine
Scheduling communication
Use of performance models for algorithm design

16
1D Layout

Assume matrices are nxn and n is divisible by p
A(i) refers to the n by n/p block column that
processor i owns (similiarly for B(i) and C(i))
B(i,j) is the n/p by n/p sublock of B(i)
in rows jn/p through (j1)n/p

17
Matrix Multiply 1D Layout on Bus

Algorithm uses the formula
C(i) C(i) AB(i) C(i) S A(j)B(j,i)
First consider a bus-connected machine without
broadcast only one pair of processors can
communicate at a time
Second consider a bus-connected machine with
broadcast may send from one to many in single
step

j ltp j 0
18
MatMul on 1D Bus without Broadcast

Naïve algorithm
C(myproc) C(myproc) A(myproc)b(myproc,my
proc)
for i 0 to p-1
for j 0 to p-1 except i
if (myproc i) send A(i) to
processor j // message passing
if (myproc j)
receive A(i) from processor i
C(myproc) C(myproc)
A(i)B(i,myproc)
barrier
Cost of inner loop
computation 2n(n/p)2 2n3/p2
communication ap2 bpn2 //
approximately

19
Naïve MatMul (continued)

Cost of inner loop
computation 2n(n/p)2 2n3/p2
communication ap2 bpn2 //
approximately
Only 1 pair of processors (i and j) are active on
any iteration,
an of those, only i is doing computation
gt the algorithm is almost
entirely serial
Running time (p(p-1) 1)computation
p(p-1)communication
2n3 p2a pn2b
this is worse than the serial time and grows
with p

20
Better MatMul on a Bus

Remove the barrier and send A(i) multiple times
C(myproc) C(myproc) A(myproc)b(myproc,my
proc)
for i 0 to myproc-1
receive A(i) from processor i
C(myproc) C(myproc) A(i)B(i,myproc)
for i 0 to p-1 except myproc
send A(myproc) to processor i
for i myproc1 to p-1
receive A(i) from processor i
C(myproc) C(myproc) A(i)B(i,myproc)
Program is indeterminate sends/receives may
happen in different orders

21
Performance of Better Algorithm

Intuitively, if a computation step is sufficient
large compared to communication, efficiency will
be good
communication a n(n/p)b
computation 2n3/p2
Assume computation gt communication
i-j represents communication
from i to j
iC represents computation step
by processor i

22
Performance of Better Algorithm

Timeline of execution
If computationlt (p-2)communication
No bubbles in the pipeline
Time p(p-1)communication 2computation
Time lt (p2 p-4)communication
If communication computation/(p-2), close to
lower bound of 2n3/p
If communication is faster, only small impact on
performance

0C 0-1 0-2 0-3
0- p-1 1C 1C 1- 0 2C
2C 2-0 p-1C
23
MatMul 1D Layout and Broadcast

Modify the previous algorithm so that each of
sends of A(i) is a broadcast (assumed 1 comm.
step)
Time is now 2n3/p pa n2b
Again, we require ngtgtp for good efficiency
p times less communication time
broadcast helps performance (as expected)

24
MatMul with 2D Layout

Consider processors in 2D grid (physical or
logical)

p(0,0) p(0,1) p(0,2)
p(1,0) p(1,1) p(1,2)
p(2,0) p(2,1) p(2,2)
25
Cannons Algorithm
klts k0

C(i,j) C(i,j) S A(i,k)B(k,j)
Algorithm (s sqrt(p))
for i0 to s-1 // skew A
left-circular-shift row i of A by i
so that A(i,j) overwritten by A(i,(ji)mod
s)
for i0 to s-1 // skew B
up-circular shift B
so that B(i,j) overwritten by B((ij)mod
s), j)
for k0 to s-1
for i0 to s-1 and j0 to s-1
C(i,j) C(i,j) A(i,j)B(i,j)
left-circular-shift each row of A by 1
up-circular-shift each row of B by 1

26
Communication in Cannon
27
BLAS Review
28
Fast linear algebra kernels BLAS

Simple linear algebra kernels such as
matrix-matrix multiply
More complicated algorithms can be built from
these basic kernels.
The interfaces of these kernels have been
standardized as the Basic Linear Algebra
Subroutines (BLAS).
Early agreement on standard interface (1980)
Led to portable libraries for vector and shared
memory parallel machines.
On distributed memory, there is a less-standard
interface called the PBLAS

29
Level 1 BLAS

Operate on vectors or pairs of vectors
perform O(n) operations
return either a vector or a scalar.
saxpy
y(i) a x(i) y(i), for i1 to n.
s stands for single precision, daxpy is for
double precision, caxpy for complex, and zaxpy
for double complex,
sscal y a x, for scalar a and vectors x,y
sdot computes s S ni1 x(i)y(i)

30
Level 2 BLAS

Operate on a matrix and a vector
return a matrix or a vector
O(n2) operations
sgemv matrix-vector multiply
y y Ax
where A is m-by-n, x is n-by-1 and y is m-by-1.
sger rank-one update
A A yx', i.e., A(i,j) A(i,j)y(i)x(j)
where A is m-by-n, y is m-by-1, x is n-by-1, x'
is x transpose
strsv triangular solve
solves yTx for x, where T is triangular

31
Level 3 BLAS

Operate on pairs or triples of matrices
returning a matrix
complexity is O(n3).
sgemm Matrix-matrix multiplication
C C AB,
where C is m-by-n, A is m-by-k, and B is k-by-n
sgtrsm multiple triangular solve
solves Y TX for X,
where T is a triangular matrix, and X is a
rectangular matrix.

32
Performance of BLAS
Level 3
Level 2
Level 1
33
Performance of BLAS

BLAS are specially optimized by the vendor
Sun BLAS uses features in the Ultrasparc
Big payoff for algorithms that can be expressed
in terms of the BLAS3 instead of BLAS2 or BLAS1.
The top speed of the BLAS3
Algorithms like Gaussian elimination organized so
that they use BLAS3

Write a Comment

User Comments (0)

About PowerShow.com

CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply PowerPoint PPT Presentation