CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply PowerPoint PPT Presentation

presentation player overlay
1 / 33
About This Presentation
Transcript and Presenter's Notes

Title: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply


1
CS 267 Applications of Parallel
ProcessorsLecture 13 Parallel Matrix Multiply
  • Kathy Yelick
  • http//www.cs.berkeley.edu/dmartin/cs267

2
Outline
- Recap - Sources of large dense linear
systems - BLAS for linear algebra - Parallel
Matrix multiply
3
Model overview
  • Work-depth
  • PRAM
  • Latency/Bandwidth model
  • a is the 1-time cost per message (latency)
  • b is the per byte cost of communication
  • Use this today
  • LogP model
  • correction gap should be greater than overhead
  • more on this with parallel sorting
  • Topology-specific models

4
Dense Linear Algebra in Electromagentics
5
Computational Electromagnetics
  • developed during 1980s, driven by defense
    applications
  • determine the RCS (radar cross section) of
    airplane
  • reduce signature of plane (stealth technology)
  • other applications are antenna design, medical
    equipment
  • two fundamental numerical approaches
  • MOM methods of moments ( frequency domain), and
  • finite differences (time domain)

6
Computational Electromagnetics
- discretize surface into triangular facets using
standard modeling tools - amplitude of currents
on surface are unknowns
- integral equation is discretized into a set of
linear equations
image NW Univ. Comp. Electromagnetics Laboratory
http//nueml.ece.nwu.edu/
7
Computational Electromagnetics (MOM)
After discretization the integral equation has
the form Z J V where Z is the
impedance matrix, J is the unknown vector of
amplitudes, and V is the excitation
vector. (see Cwik, Patterson, and Scott,
Electromagnetic Scattering on the Intel
Touchstone Delta, IEEE Supercomputing 92, pp 538
- 542)
8
Computational Electromagnetics (MOM)
The main steps in the solution process are A)
computing the matrix elements B) factoring the
dense matrix C) solving for one or more
excitations (RHS) D) computing the fields
scattered from the object
9
Analysis of MOM for Parallel Implementation
Task Work Parallelism
Parallel Speed
Fill O(n2) embarrassing
low Factor O(n3)
moderately diff. very high Solve
O(n2) moderately diff.
high Field Calc. O(n) embarrassing
high
For most scientific applications the biggest gain
in performance is from parallelism within each
task.
10
Results for Parallel Implementation on Delta
Task Time (hours)
Fill 9.20
Factor 8.25 Solve
2.17 Field
Calc. 0.12
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
11
Current Records for Solving Dense Systems
Year System Size Machine 1950's
O(100) 1991 55,296
CM-2 1992 75,264 Intel
1993 75,264 Intel 1994
76,800 CM-5 1995
128,600 Paragon XP 1996
215,000 ASCI Red (Tflop)

source Alan Edelman http//www-math.mit.edu/edel
man/records.html
12
Sources for large dense linear systems
- Not many basic factorizations outside CEM -
Large dense eigen problems used in chemistry -
Alternatives often debated
Choice for algorithms in existing codes are not
the result of careful planning and design. -
Reflect the start-of-the-art at the time, - May
be purely coincidental.
13
Solving Large Dense Linear Systems
  • Gaussian elimination to solve Axb
  • where A is a dense matrix
  • Add multiples of each row to subsequent rows in
    order to create zeros below the diagonal
  • End up with an upper triangular matrix U.
  • Solve a linear system with U by substitution,
    starting with the last variable.
  • Solving these systems uses basic vector and
    matrix operations called BLAS.

see Demmel http//HTTP.CS.Berkeley.EDU/demmel/cs2
67/lecture12/lecture12.html
14
Parallel Matrix Multiply
15
Parallel Matrix Multiply
  • Computing CCAB
  • Using basic algorithm 2n3 Flops
  • Variables are
  • Data layout
  • Topology of machine
  • Scheduling communication
  • Use of performance models for algorithm design

16
1D Layout
  • Assume matrices are nxn and n is divisible by p
  • A(i) refers to the n by n/p block column that
    processor i owns (similiarly for B(i) and C(i))
  • B(i,j) is the n/p by n/p sublock of B(i)
  • in rows jn/p through (j1)n/p

17
Matrix Multiply 1D Layout on Bus
  • Algorithm uses the formula
  • C(i) C(i) AB(i) C(i) S A(j)B(j,i)
  • First consider a bus-connected machine without
    broadcast only one pair of processors can
    communicate at a time
  • Second consider a bus-connected machine with
    broadcast may send from one to many in single
    step

j ltp j 0
18
MatMul on 1D Bus without Broadcast
  • Naïve algorithm
  • C(myproc) C(myproc) A(myproc)b(myproc,my
    proc)
  • for i 0 to p-1
  • for j 0 to p-1 except i
  • if (myproc i) send A(i) to
    processor j // message passing
  • if (myproc j)
  • receive A(i) from processor i
  • C(myproc) C(myproc)
    A(i)B(i,myproc)
  • barrier
  • Cost of inner loop
  • computation 2n(n/p)2 2n3/p2
  • communication ap2 bpn2 //
    approximately

19
Naïve MatMul (continued)
  • Cost of inner loop
  • computation 2n(n/p)2 2n3/p2
  • communication ap2 bpn2 //
    approximately
  • Only 1 pair of processors (i and j) are active on
    any iteration,
  • an of those, only i is doing computation
  • gt the algorithm is almost
    entirely serial
  • Running time (p(p-1) 1)computation
    p(p-1)communication
  • 2n3 p2a pn2b
  • this is worse than the serial time and grows
    with p

20
Better MatMul on a Bus
  • Remove the barrier and send A(i) multiple times
  • C(myproc) C(myproc) A(myproc)b(myproc,my
    proc)
  • for i 0 to myproc-1
  • receive A(i) from processor i
  • C(myproc) C(myproc) A(i)B(i,myproc)
  • for i 0 to p-1 except myproc
  • send A(myproc) to processor i
  • for i myproc1 to p-1
  • receive A(i) from processor i
  • C(myproc) C(myproc) A(i)B(i,myproc)
  • Program is indeterminate sends/receives may
    happen in different orders

21
Performance of Better Algorithm
  • Intuitively, if a computation step is sufficient
    large compared to communication, efficiency will
    be good
  • communication a n(n/p)b
  • computation 2n3/p2
  • Assume computation gt communication
  • i-j represents communication
    from i to j
  • iC represents computation step
    by processor i

22
Performance of Better Algorithm
  • Timeline of execution
  • If computationlt (p-2)communication
  • No bubbles in the pipeline
  • Time p(p-1)communication 2computation
  • Time lt (p2 p-4)communication
  • If communication computation/(p-2), close to
    lower bound of 2n3/p
  • If communication is faster, only small impact on
    performance

0C 0-1 0-2 0-3
0- p-1 1C 1C 1- 0 2C
2C 2-0 p-1C
23
MatMul 1D Layout and Broadcast
  • Modify the previous algorithm so that each of
    sends of A(i) is a broadcast (assumed 1 comm.
    step)
  • Time is now 2n3/p pa n2b
  • Again, we require ngtgtp for good efficiency
  • p times less communication time
  • broadcast helps performance (as expected)

24
MatMul with 2D Layout
  • Consider processors in 2D grid (physical or
    logical)

p(0,0) p(0,1) p(0,2)
p(1,0) p(1,1) p(1,2)
p(2,0) p(2,1) p(2,2)
25
Cannons Algorithm
klts k0
  • C(i,j) C(i,j) S A(i,k)B(k,j)
  • Algorithm (s sqrt(p))
  • for i0 to s-1 // skew A
  • left-circular-shift row i of A by i
  • so that A(i,j) overwritten by A(i,(ji)mod
    s)
  • for i0 to s-1 // skew B
  • up-circular shift B
  • so that B(i,j) overwritten by B((ij)mod
    s), j)
  • for k0 to s-1
  • for i0 to s-1 and j0 to s-1
  • C(i,j) C(i,j) A(i,j)B(i,j)
  • left-circular-shift each row of A by 1
  • up-circular-shift each row of B by 1

26
Communication in Cannon
27
BLAS Review
28
Fast linear algebra kernels BLAS
  • Simple linear algebra kernels such as
    matrix-matrix multiply
  • More complicated algorithms can be built from
    these basic kernels.
  • The interfaces of these kernels have been
    standardized as the Basic Linear Algebra
    Subroutines (BLAS).
  • Early agreement on standard interface (1980)
  • Led to portable libraries for vector and shared
    memory parallel machines.
  • On distributed memory, there is a less-standard
    interface called the PBLAS

29
Level 1 BLAS
  • Operate on vectors or pairs of vectors
  • perform O(n) operations
  • return either a vector or a scalar.
  • saxpy
  • y(i) a x(i) y(i), for i1 to n.
  • s stands for single precision, daxpy is for
    double precision, caxpy for complex, and zaxpy
    for double complex,
  • sscal y a x, for scalar a and vectors x,y
  • sdot computes s S ni1 x(i)y(i)

30
Level 2 BLAS
  • Operate on a matrix and a vector
  • return a matrix or a vector
  • O(n2) operations
  • sgemv matrix-vector multiply
  • y y Ax
  • where A is m-by-n, x is n-by-1 and y is m-by-1.
  • sger rank-one update
  • A A yx', i.e., A(i,j) A(i,j)y(i)x(j)
  • where A is m-by-n, y is m-by-1, x is n-by-1, x'
    is x transpose
  • strsv triangular solve
  • solves yTx for x, where T is triangular

31
Level 3 BLAS
  • Operate on pairs or triples of matrices
  • returning a matrix
  • complexity is O(n3).
  • sgemm Matrix-matrix multiplication
  • C C AB,
  • where C is m-by-n, A is m-by-k, and B is k-by-n
  • sgtrsm multiple triangular solve
  • solves Y TX for X,
  • where T is a triangular matrix, and X is a
    rectangular matrix.

32
Performance of BLAS
Level 3
Level 2
Level 1
33
Performance of BLAS
  • BLAS are specially optimized by the vendor
  • Sun BLAS uses features in the Ultrasparc
  • Big payoff for algorithms that can be expressed
    in terms of the BLAS3 instead of BLAS2 or BLAS1.
  • The top speed of the BLAS3
  • Algorithms like Gaussian elimination organized so
    that they use BLAS3
Write a Comment
User Comments (0)
About PowerShow.com