Title: Matrix Vector Multiplication Summary
1Matrix Vector Multiplication(Summary)
2Parallel Matrix Vector Multiplication
- We want to multiply a matrix by a vector
- If the matrix is m rows and n columns, then the
vector must be n columns - Parallelization
- We can divide the vector
- Into vectors of smaller sizes
- Replicate into each processor (vectors require
less memory) - We can divide the matrix
- Along the rows
- Along the columns
- Into small blocks
Dividing the data among processors
3Three Algorithms
- Decompose matrix along rows and replicate the
vector - Decompose matrix along columns and divide the
vector - Decompose matrix in blocks and divide the vector
NOTE It makes sense to divide up the vector when
the matrix has less rows
4Analyzing a Parallel Algorithm
- Partitioning
- Communication
- Agglomeration Mapping
- Computational Complexity
- Communicational Complexity
- Scalability
Algorithmic Characteristics
Performance Evaluation
5Method 1 Matrix divided by RowVector Replicated
9
7
2
5
6MPI_Allgatherv
7MPI_Allgatherv
int MPI_Allgatherv ( void
send_buffer, int send_cnt,
MPI_Datatype send_type, void
receive_buffer, int receive_cnt,
int receive_disp, MPI_Datatype
receive_type, MPI_Comm communicator)
8Agglomeration and Mapping
- Static number of tasks
- Regular communication pattern (all-gather)
- Computation time per task is constant
- Strategy
- Agglomerate groups of rows
- Create one task per MPI process
9Complexity Analysis
- Sequential algorithm complexity ?(n2)
- Parallel algorithm computational complexity
?(n2/p) - Each processor has n/p rows of width n
- Each inner product takes n(n/p)
- Communication complexity of all-gather ?(log p
n) - Gather takes log(p) steps
- Each processor takes n/p elements
- It sends it to p-1 processors
- Total log(p)latency n/p(p-1)/bandwidth
- Overall complexity ?(n2/p log p)
10Isoefficiency Analysis
- Sequential time complexity ?(n2)
- Parallel communication is all-gather
- Communication complexity
- log(p)latency (n/p(p-1))/bandwidth
- When n is large, message transmission time
dominates message latency - Parallel communication time ?(n)
- n2 ? Cpn ? n ? Cp and M(n) n2
- T(1,n)gtCTo(n,p)
- System is not highly scalable
11Method 1 Matrix divided by ColumnVector
Decomposed
1
0
1
2
2061
1001
0023
2023
12Getting Columns of the Matrix
- In the memory the matrices are stored as rows
- Let one process handle the I/O
- Read the matrix from memory and distribute it as
columns
13MPI_Scatterv
14Header for MPI_Scatterv
int MPI_Scatterv ( void send_buffer,
int send_cnt, int
send_disp, MPI_Datatype send_type, void
receive_buffer, int
receive_cnt, MPI_Datatype receive_type,
int root, MPI_Comm communicator)
15Communication
- After calculating inner product, each processor
sends its entire partial vector to other
processes
2061
1001
2023
0023
16Function MPI_Alltoallv
17Header for MPI_Alltoallv
int MPI_Gatherv ( void send_buffer,
int send_cnt, int
send_disp, MPI_Datatype send_type, void
receive_buffer, int
receive_cnt, int receive_disp,
MPI_Datatype receive_type, MPI_Comm
communicator)
18Agglomeration and Mapping
- Static number of tasks
- Regular communication pattern (all-to-all)
- Computation time per task is constant
- Strategy
- Agglomerate groups of columns
- Create one task per MPI process
19Complexity Analysis
- Sequential algorithm complexity ?(n2)
- Parallel algorithm computational complexity
?(n2/p) - Each processor has n/p columns
- Each column has n elements
- Communication complexity?(p nlog(p))
- Scatter takes log(p) steps sends n/p elements
to (p-1) processors - Log(p)latency(n/p)(p-1)/bandwidth
- All-to-all
- Option 1 Each process sends a message to the
rest, sending the - Destination only the part required
- Number of messages (p-1)
- Total amount sent O(n)
- Complexity ( p-1)latencyn/bandwidth
Total ?(p n2/p )
20Isoefficiency Analysis
- Sequential time complexity ?(n2)
- Only parallel overhead is all-to-all
- When n is large, message transmission time
dominates message latency - Parallel communication time ?(n p log(p))
- n2 ? Cpn ? n ? Cp
- Scalability function same as rowwise algorithm
C2p
21Printing the Results Vectors
- In order to view the vector in order of indices,
only one processor should print it - This is opposite of scatter a gather operation
9
2
5
7
22Function MPI_Gatherv
23Header for MPI_Gatherv
int MPI_Gatherv ( void send_buffer,
int send_cnt, MPI_Datatype
send_type, void receive_buffer,
int receive_cnt, int
receive_disp, MPI_Datatype receive_type,
int root, MPI_Comm communicator)
24Count/Displacement Arrays
- Scatter, Gather, All to all requires two types of
count/displacement arrays - First pair for values being sent
- send_cnt number of elements (always)
- send_disp index of first element (If breaking a
array) - Second pair for values being received
- recv_cnt number of elements (always)
- recv_disp index of first element (If joining
into array)
25Method 1 Matrix divided into CrossSectionsVecto
r Decomposed
26Decomposing the Vector
27Agglomeration and Mapping
- Static number of tasks
- Regular communication pattern (all-to-all)
- Computation time per task is constant
- Strategy
- Agglomerate groups of columns
- Create one task per MPI process
28Complexity Analysis
- p is a square number
- Each process does its share of computation
?(n2/p) - There are n / ?p n / ?p matrices in each block
- Redistribute b ?( log (? p)(n / ?p ))
- Sending n / ?p information
- Can do it in log (? p) steps (sends it to log (?
p) processors) - Reduction of partial results vectors?( log (?
p)(n / ?p )) - Sending information log (? p)
- Have to add n / ?p elements
- Overall parallel complexity ?(n2/p log (?
p)(n / ?p ))
29Isoefficiency Analysis
- Sequential complexity ?(n2)
- Parallel communication complexity ? (log (?
p)(n / ?p )) ?(n log p / ?p) - Isoefficiency functionn2 ? Cn ?p log p ? n ? C
?p log p - This system is much more scalable than the
previous two implementations