Matrix Vector Multiplication Summary - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Matrix Vector Multiplication Summary

Description:

Decompose matrix along columns and divide the vector ... Getting Columns of the Matrix. In the memory the matrices are stored as rows ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 30
Provided by: saikatmuk
Category:

less

Transcript and Presenter's Notes

Title: Matrix Vector Multiplication Summary


1
Matrix Vector Multiplication(Summary)
2
Parallel Matrix Vector Multiplication
  • We want to multiply a matrix by a vector
  • If the matrix is m rows and n columns, then the
    vector must be n columns
  • Parallelization
  • We can divide the vector
  • Into vectors of smaller sizes
  • Replicate into each processor (vectors require
    less memory)
  • We can divide the matrix
  • Along the rows
  • Along the columns
  • Into small blocks

Dividing the data among processors
3
Three Algorithms
  • Decompose matrix along rows and replicate the
    vector
  • Decompose matrix along columns and divide the
    vector
  • Decompose matrix in blocks and divide the vector

NOTE It makes sense to divide up the vector when
the matrix has less rows
4
Analyzing a Parallel Algorithm
  • Partitioning
  • Communication
  • Agglomeration Mapping
  • Computational Complexity
  • Communicational Complexity
  • Scalability

Algorithmic Characteristics
Performance Evaluation
5
Method 1 Matrix divided by RowVector Replicated
9
7
2
5
6
MPI_Allgatherv
7
MPI_Allgatherv
int MPI_Allgatherv ( void
send_buffer, int send_cnt,
MPI_Datatype send_type, void
receive_buffer, int receive_cnt,
int receive_disp, MPI_Datatype
receive_type, MPI_Comm communicator)
8
Agglomeration and Mapping
  • Static number of tasks
  • Regular communication pattern (all-gather)
  • Computation time per task is constant
  • Strategy
  • Agglomerate groups of rows
  • Create one task per MPI process

9
Complexity Analysis
  • Sequential algorithm complexity ?(n2)
  • Parallel algorithm computational complexity
    ?(n2/p)
  • Each processor has n/p rows of width n
  • Each inner product takes n(n/p)
  • Communication complexity of all-gather ?(log p
    n)
  • Gather takes log(p) steps
  • Each processor takes n/p elements
  • It sends it to p-1 processors
  • Total log(p)latency n/p(p-1)/bandwidth
  • Overall complexity ?(n2/p log p)

10
Isoefficiency Analysis
  • Sequential time complexity ?(n2)
  • Parallel communication is all-gather
  • Communication complexity
  • log(p)latency (n/p(p-1))/bandwidth
  • When n is large, message transmission time
    dominates message latency
  • Parallel communication time ?(n)
  • n2 ? Cpn ? n ? Cp and M(n) n2
  • T(1,n)gtCTo(n,p)
  • System is not highly scalable

11
Method 1 Matrix divided by ColumnVector
Decomposed
1
0
1
2
2061
1001
0023
2023
12
Getting Columns of the Matrix
  • In the memory the matrices are stored as rows
  • Let one process handle the I/O
  • Read the matrix from memory and distribute it as
    columns

13
MPI_Scatterv
14
Header for MPI_Scatterv
int MPI_Scatterv ( void send_buffer,
int send_cnt, int
send_disp, MPI_Datatype send_type, void
receive_buffer, int
receive_cnt, MPI_Datatype receive_type,
int root, MPI_Comm communicator)
15
Communication
  • After calculating inner product, each processor
    sends its entire partial vector to other
    processes

2061
1001
2023
0023
16
Function MPI_Alltoallv
17
Header for MPI_Alltoallv
int MPI_Gatherv ( void send_buffer,
int send_cnt, int
send_disp, MPI_Datatype send_type, void
receive_buffer, int
receive_cnt, int receive_disp,
MPI_Datatype receive_type, MPI_Comm
communicator)
18
Agglomeration and Mapping
  • Static number of tasks
  • Regular communication pattern (all-to-all)
  • Computation time per task is constant
  • Strategy
  • Agglomerate groups of columns
  • Create one task per MPI process

19
Complexity Analysis
  • Sequential algorithm complexity ?(n2)
  • Parallel algorithm computational complexity
    ?(n2/p)
  • Each processor has n/p columns
  • Each column has n elements
  • Communication complexity?(p nlog(p))
  • Scatter takes log(p) steps sends n/p elements
    to (p-1) processors
  • Log(p)latency(n/p)(p-1)/bandwidth
  • All-to-all
  • Option 1 Each process sends a message to the
    rest, sending the
  • Destination only the part required
  • Number of messages (p-1)
  • Total amount sent O(n)
  • Complexity ( p-1)latencyn/bandwidth

Total ?(p n2/p )
20
Isoefficiency Analysis
  • Sequential time complexity ?(n2)
  • Only parallel overhead is all-to-all
  • When n is large, message transmission time
    dominates message latency
  • Parallel communication time ?(n p log(p))
  • n2 ? Cpn ? n ? Cp
  • Scalability function same as rowwise algorithm
    C2p

21
Printing the Results Vectors
  • In order to view the vector in order of indices,
    only one processor should print it
  • This is opposite of scatter a gather operation

9
2
5
7
22
Function MPI_Gatherv
23
Header for MPI_Gatherv
int MPI_Gatherv ( void send_buffer,
int send_cnt, MPI_Datatype
send_type, void receive_buffer,
int receive_cnt, int
receive_disp, MPI_Datatype receive_type,
int root, MPI_Comm communicator)
24
Count/Displacement Arrays
  • Scatter, Gather, All to all requires two types of
    count/displacement arrays
  • First pair for values being sent
  • send_cnt number of elements (always)
  • send_disp index of first element (If breaking a
    array)
  • Second pair for values being received
  • recv_cnt number of elements (always)
  • recv_disp index of first element (If joining
    into array)


25
Method 1 Matrix divided into CrossSectionsVecto
r Decomposed
26
Decomposing the Vector
27
Agglomeration and Mapping
  • Static number of tasks
  • Regular communication pattern (all-to-all)
  • Computation time per task is constant
  • Strategy
  • Agglomerate groups of columns
  • Create one task per MPI process

28
Complexity Analysis
  • p is a square number
  • Each process does its share of computation
    ?(n2/p)
  • There are n / ?p n / ?p matrices in each block
  • Redistribute b ?( log (? p)(n / ?p ))
  • Sending n / ?p information
  • Can do it in log (? p) steps (sends it to log (?
    p) processors)
  • Reduction of partial results vectors?( log (?
    p)(n / ?p ))
  • Sending information log (? p)
  • Have to add n / ?p elements
  • Overall parallel complexity ?(n2/p log (?
    p)(n / ?p ))

29
Isoefficiency Analysis
  • Sequential complexity ?(n2)
  • Parallel communication complexity ? (log (?
    p)(n / ?p )) ?(n log p / ?p)
  • Isoefficiency functionn2 ? Cn ?p log p ? n ? C
    ?p log p
  • This system is much more scalable than the
    previous two implementations
Write a Comment
User Comments (0)
About PowerShow.com