Topic Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Topic Overview

Description:

Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix Multiplication Cannon's Algorithm Overlapping Communication with ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 20
Provided by: Gunnu4
Category:

less

Transcript and Presenter's Notes

Title: Topic Overview


1
Topic Overview
  • Matrix-Matrix Multiplication
  • Block Matrix Operations
  • A Simple Parallel Matrix-Matrix Multiplication
  • Cannon's Algorithm
  • Overlapping Communication with Computation

2
Matrix-Matrix Multiplication
  • Building on our matrix-vector multiplication
    (Quinns Chapter 8), we now consider
    matrix-matrix multiplication
  • multiplying two n x n dense, square matrices A
    and B to yield the product matrix C A x B.
  • For simplicity, we use the following serial
    algorithm

3
Block Matrix Operations
  • Matrix computations involving scalar algebraic
    operations on the matrix elements can be
    expressed in terms of identical operations on
    submatrices of the original matrix.
  • Such algebraic operations on the submatrices are
    called block matrix operations.
  • useful in matrix multiplication as well as in a
    variety of other matrix algorithms
  • In this view, an n x n matrix A can be regarded
    as a q x q array of blocks Ai,j (0 i, j lt q)
    such that each block is an (n/q) x (n/q)
    submatrix.
  • We perform q3 matrix multiplications, each
    involving (n/q) x (n/q) matrices.
  • requiring (n/q)3 additions and multiplications

4
Block Matrix Operations
5
A Simple Parallel Matrix-Matrix Multiplication
Algorithm
  • Consider two n x n matrices A and B partitioned
    into p blocks Ai,j and Bi,j (0 i, j lt ) of
    size each.
  • Process Pi,j initially stores Ai,j and Bi,j and
    computes block Ci,j of the result matrix.
  • Computing submatrix Ci,j requires all submatrices
    Ai,k and Bk,j for 0 k lt .
  • All-to-all broadcast blocks of A along rows and B
    along columns.
  • Perform local submatrix multiplication.

6
Matrix-Matrix Multiplication Performance Analysis
  • The two broadcasts take time
  • The computation requires multiplications of
    sized submatrices.
  • The parallel run time is approximately

7
Drawback of the Simple Parallel Algorithm
  • A major drawback of this algorithm is that it is
    not memory optimal
  • Each process has blocks of both matrices
    A and B at the end of each communication phase
  • Thus, each process requires
    memory
  • Since each block requires memory
  • The total memory requirement over all the
    processes is
  • i.e., times the memory requirement of
    the sequential algorithm.

8
Matrix-Matrix Multiplication Cannon's Algorithm
  • Cannon's algorithm is a memory-efficient version
    of the simple parallel algorithm
  • With a total memory requirement of Q(n2)
  • Matrices A and B are partitioned into p square
    blocks as in the simple parallel algorithm
  • Although every process in the ith row requires
    all submatrices, the all-to-all broadcast
    can be avoided by
  • scheduling the computations of the
    processes of the ith row such that, at any given
    time, each process is using a different block
    Ai,k.
  • systematically rotating these blocks among the
    processes after every submatrix multiplication so
    that every process gets a fresh Ai,k after each
    rotation.
  • If an identical schedule is applied to the
    columns of B, then no process holds more than one
    block of each matrix at any time

9
Communication Steps in Cannon's Algorithm
10
Communication Steps in Cannon's Algorithm
  • First, align the blocks of A and B in such a way
    that each process multiplies its local
    submatrices
  • shift submatrices Ai,j to the left (with
    wraparound) by i steps
  • shift submatrices Bi,j up (with wraparound) by j
    steps.
  • After alignment (Figure 8.3c)
  • Process Pi,j has submatrices
    and .
  • Perform local block matrix multiplication.
  • Next, each block of A moves one step left and
    each block of B moves one step up (again with
    wraparound).
  • Perform next block multiplication, add to partial
    result, repeat until all the blocks
    have been multiplied.

11
Cannon's Algorithm An Example
  • Consider the matrices to be multiplied
  •  
  •  
  • Assume that these matrices are portioned into 4
    square blocks as follows
  •  
  •  
  •  
  • After the initial alignment, matrices A and B
    become

12
Cannon's Algorithm An Example
  • After this alignment, process
  • P0,0 ends up with A0,0 and B0,0 and should
    compute c0,0
  • P0,1 ends up with A0,1 and B1,1 and should
    compute c0,1
  • P1,0 ends up with A1,1 and B1,0 and should
    compute c1,0
  • P1,1 ends up with A1,0 and B0,1 and should
    compute c1,1
  •  
  • The local block matrix multiplications proceed as
    follows
  •  

13
Cannon's Algorithm An Example
  • Shift 1 shift each block of A one step to the
    left and shift each block of B one step up
  •  
  • Next, each process Pi,j performs block
    multiplication, updating Ci,j
  •  

14
Cannon's Algorithm Performance Analysis
  • In the alignment step, the maximum distance over
    which a block shifts is ,
  • the two shift operations require a total of
    time.
  • Each of the single-step shifts in the
    compute-and-shift phase of the algorithm takes
    time.
  • The computation time for multiplying
    matrices of size is .
  • The parallel time is approximately

15
MPI_Cart_shift Function
  • Shifting data along the dimensions of the 2-D
    mesh is a frequent operation in the Cannons
    algorithm
  • MPI provides the function MPI_Cart_shift for this
    purpose.
  • int MPI_Cart_shift(
  • MPI_Comm comm_cart,/ communicator with
    Cartesian structure
    (handle)/
  • int dir, / direction of shift (gt 0 up
    shift, lt 0 down shift) /
  • int s_step, / shift size/displacement /
  • int rank_source, / rank of source process
    /
  • int rank_dest) / rank of destination
    process /
  • Here is an example program exercising this
    function.

16
Sending and Receiving Messages Simultaneously
  • To exchange messages, MPI provides the following
    function
  • int MPI_Sendrecv(void sendbuf, int sendcount,
  • MPI_Datatype senddatatype, int dest, int
  • sendtag, void recvbuf, int recvcount,
  • MPI_Datatype recvdatatype, int source,
  • int recvtag, MPI_Comm comm, MPI_Status
    status)
  • The arguments include arguments to the send and
    receive functions.
  • If we wish to use the same buffer for both send
    and receive, we can use
  • int MPI_Sendrecv_replace(void buf, int count,
  • MPI_Datatype datatype, int dest,
  • int sendtag, int source, int recvtag,
  • MPI_Comm comm, MPI_Status status)
  • A Parallel program for Cannons algorithm is here.

17
Overlapping Communication with Computation
  • Our MPI programs so far used blocking
    send/receive operations to perform point-to-point
    communication.
  • As discussed earlier,
  • a blocking send operation remains blocked until
    the message has been copied out of the send
    buffer
  • a blocking receive operation returns only after
    the message has been received and copied into the
    receive buffer.
  • In the Cannon algorithm, for example, each
    process blocks on MPI_Sendrecv_replace
  • until the specified matrix block has been sent
    and received by the corresponding processes.
  • Note that the blocks of matrices A and B do not
    change as they are shifted among the processors
  • Thus, we can overlap the transmission of these
    blocks with the computation for the matrix-matrix
    multiplication
  • Many recent distributed-memory parallel computers
    have dedicated communication controllers that can
    perform the transmission of messages without
    interrupting the CPUs.

18
Non-Blocking Communication Operations
  • In order to overlap communication with
    computation, MPI provides a pair of functions for
    performing non-blocking send and receive
    operations.
  • int MPI_Isend(void buf, int count, MPI_Datatype
    datatype,
  • int dest, int tag, MPI_Comm comm,MPI_Request
    request)
  • int MPI_Irecv(void buf, int count, MPI_Datatype
    datatype,
  • int source, int tag, MPI_Comm comm,
    MPI_Request request)
  • These operations return before the operations
    have been completed.
  • Function MPI_Test tests whether or not the
    non-blocking send or receive operation identified
    by its request has finished.
  • int MPI_Test(MPI_Request request, int flag,
    MPI_Status status)
  • MPI_Wait waits for the operation to complete. An
    example is here.
  • int MPI_Wait(MPI_Request request, MPI_Status
    status)

19
Canons Algorithm using Non-Blocking Operations
  • Here is the parallel program for Cannons
    algorithm using nonblocking operations
  • Two main differences between this program and the
    earlier one using blocking operations
  • Additional arrays a_buffers and b_buffers, are
    used for the blocks of A and B that are being
    received while the computation involving the
    previous blocks is performed.
  • in the main computational loop, it first starts
    the non-blocking send operations to send the
    locally stored blocks of A and B to the processes
    left and up the grid, and then starts the
    non-blocking receive operations to receive the
    blocks for the next iteration from the processes
    right and down the grid.
  • After starting these four non-blocking
    operations, it proceeds to perform the
    matrix-matrix multiplication of the blocks it
    currently stores.
  • Finally, before it proceeds to the next
    iteration, it uses MPI_Wait to wait for the send
    and receive operations to complete.
Write a Comment
User Comments (0)
About PowerShow.com