Topic Overview - PowerPoint PPT Presentation

About This Presentation

Title:

Topic Overview

Description:

Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix Multiplication Cannon's Algorithm Overlapping Communication with ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 20

Provided by: Gunnu4

Category:

more less

Transcript and Presenter's Notes

Title: Topic Overview

1
Topic Overview

Matrix-Matrix Multiplication
Block Matrix Operations
A Simple Parallel Matrix-Matrix Multiplication
Cannon's Algorithm
Overlapping Communication with Computation

2
Matrix-Matrix Multiplication

Building on our matrix-vector multiplication
(Quinns Chapter 8), we now consider
matrix-matrix multiplication
multiplying two n x n dense, square matrices A
and B to yield the product matrix C A x B.
For simplicity, we use the following serial
algorithm

3
Block Matrix Operations

Matrix computations involving scalar algebraic
operations on the matrix elements can be
expressed in terms of identical operations on
submatrices of the original matrix.
Such algebraic operations on the submatrices are
called block matrix operations.
useful in matrix multiplication as well as in a
variety of other matrix algorithms
In this view, an n x n matrix A can be regarded
as a q x q array of blocks Ai,j (0 i, j lt q)
such that each block is an (n/q) x (n/q)
submatrix.
We perform q3 matrix multiplications, each
involving (n/q) x (n/q) matrices.
requiring (n/q)3 additions and multiplications

4
Block Matrix Operations
5
A Simple Parallel Matrix-Matrix Multiplication
Algorithm

Consider two n x n matrices A and B partitioned
into p blocks Ai,j and Bi,j (0 i, j lt ) of
size each.
Process Pi,j initially stores Ai,j and Bi,j and
computes block Ci,j of the result matrix.
Computing submatrix Ci,j requires all submatrices
Ai,k and Bk,j for 0 k lt .
All-to-all broadcast blocks of A along rows and B
along columns.
Perform local submatrix multiplication.

6
Matrix-Matrix Multiplication Performance Analysis

The two broadcasts take time
The computation requires multiplications of
sized submatrices.
The parallel run time is approximately

7
Drawback of the Simple Parallel Algorithm

A major drawback of this algorithm is that it is
not memory optimal
Each process has blocks of both matrices
A and B at the end of each communication phase
Thus, each process requires
memory
Since each block requires memory
The total memory requirement over all the
processes is
i.e., times the memory requirement of
the sequential algorithm.

8
Matrix-Matrix Multiplication Cannon's Algorithm

Cannon's algorithm is a memory-efficient version
of the simple parallel algorithm
With a total memory requirement of Q(n2)
Matrices A and B are partitioned into p square
blocks as in the simple parallel algorithm
Although every process in the ith row requires
all submatrices, the all-to-all broadcast
can be avoided by
scheduling the computations of the
processes of the ith row such that, at any given
time, each process is using a different block
Ai,k.
systematically rotating these blocks among the
processes after every submatrix multiplication so
that every process gets a fresh Ai,k after each
rotation.
If an identical schedule is applied to the
columns of B, then no process holds more than one
block of each matrix at any time

9
Communication Steps in Cannon's Algorithm
10
Communication Steps in Cannon's Algorithm

First, align the blocks of A and B in such a way
that each process multiplies its local
submatrices
shift submatrices Ai,j to the left (with
wraparound) by i steps
shift submatrices Bi,j up (with wraparound) by j
steps.
After alignment (Figure 8.3c)
Process Pi,j has submatrices
and .
Perform local block matrix multiplication.
Next, each block of A moves one step left and
each block of B moves one step up (again with
wraparound).
Perform next block multiplication, add to partial
result, repeat until all the blocks
have been multiplied.

11
Cannon's Algorithm An Example

Consider the matrices to be multiplied
Assume that these matrices are portioned into 4
square blocks as follows
After the initial alignment, matrices A and B
become

12
Cannon's Algorithm An Example

After this alignment, process
P0,0 ends up with A0,0 and B0,0 and should
compute c0,0
P0,1 ends up with A0,1 and B1,1 and should
compute c0,1
P1,0 ends up with A1,1 and B1,0 and should
compute c1,0
P1,1 ends up with A1,0 and B0,1 and should
compute c1,1
The local block matrix multiplications proceed as
follows

13
Cannon's Algorithm An Example

Shift 1 shift each block of A one step to the
left and shift each block of B one step up
Next, each process Pi,j performs block
multiplication, updating Ci,j

14
Cannon's Algorithm Performance Analysis

In the alignment step, the maximum distance over
which a block shifts is ,
the two shift operations require a total of
time.
Each of the single-step shifts in the
compute-and-shift phase of the algorithm takes
time.
The computation time for multiplying
matrices of size is .
The parallel time is approximately

15
MPI_Cart_shift Function

Shifting data along the dimensions of the 2-D
mesh is a frequent operation in the Cannons
algorithm
MPI provides the function MPI_Cart_shift for this
purpose.
int MPI_Cart_shift(
MPI_Comm comm_cart,/ communicator with
Cartesian structure
(handle)/
int dir, / direction of shift (gt 0 up
shift, lt 0 down shift) /
int s_step, / shift size/displacement /
int rank_source, / rank of source process
/
int rank_dest) / rank of destination
process /
Here is an example program exercising this
function.

16
Sending and Receiving Messages Simultaneously

To exchange messages, MPI provides the following
function
int MPI_Sendrecv(void sendbuf, int sendcount,
MPI_Datatype senddatatype, int dest, int
sendtag, void recvbuf, int recvcount,
MPI_Datatype recvdatatype, int source,
int recvtag, MPI_Comm comm, MPI_Status
status)
The arguments include arguments to the send and
receive functions.
If we wish to use the same buffer for both send
and receive, we can use
int MPI_Sendrecv_replace(void buf, int count,
MPI_Datatype datatype, int dest,
int sendtag, int source, int recvtag,
MPI_Comm comm, MPI_Status status)
A Parallel program for Cannons algorithm is here.

17
Overlapping Communication with Computation

Our MPI programs so far used blocking
send/receive operations to perform point-to-point
communication.
As discussed earlier,
a blocking send operation remains blocked until
the message has been copied out of the send
buffer
a blocking receive operation returns only after
the message has been received and copied into the
receive buffer.
In the Cannon algorithm, for example, each
process blocks on MPI_Sendrecv_replace
until the specified matrix block has been sent
and received by the corresponding processes.
Note that the blocks of matrices A and B do not
change as they are shifted among the processors
Thus, we can overlap the transmission of these
blocks with the computation for the matrix-matrix
multiplication
Many recent distributed-memory parallel computers
have dedicated communication controllers that can
perform the transmission of messages without
interrupting the CPUs.

18
Non-Blocking Communication Operations

In order to overlap communication with
computation, MPI provides a pair of functions for
performing non-blocking send and receive
operations.
int MPI_Isend(void buf, int count, MPI_Datatype
datatype,
int dest, int tag, MPI_Comm comm,MPI_Request
request)
int MPI_Irecv(void buf, int count, MPI_Datatype
datatype,
int source, int tag, MPI_Comm comm,
MPI_Request request)
These operations return before the operations
have been completed.
Function MPI_Test tests whether or not the
non-blocking send or receive operation identified
by its request has finished.
int MPI_Test(MPI_Request request, int flag,
MPI_Status status)
MPI_Wait waits for the operation to complete. An
example is here.
int MPI_Wait(MPI_Request request, MPI_Status
status)

19
Canons Algorithm using Non-Blocking Operations

Here is the parallel program for Cannons
algorithm using nonblocking operations
Two main differences between this program and the
earlier one using blocking operations
Additional arrays a_buffers and b_buffers, are
used for the blocks of A and B that are being
received while the computation involving the
previous blocks is performed.
in the main computational loop, it first starts
the non-blocking send operations to send the
locally stored blocks of A and B to the processes
left and up the grid, and then starts the
non-blocking receive operations to receive the
blocks for the next iteration from the processes
right and down the grid.
After starting these four non-blocking
operations, it proceeds to perform the
matrix-matrix multiplication of the blocks it
currently stores.
Finally, before it proceeds to the next
iteration, it uses MPI_Wait to wait for the send
and receive operations to complete.