Title: CS 584
1CS 584
2Dense Matrix Algorithms
- There are two types of Matrices
- Dense (Full)
- Sparse
- We will consider matrices that are
- Dense
- Square
3Mapping Matrices
- How do we partition a matrix for parallel
processing? - There are two basic ways
- Striped partitioning
- Block partitioning
4Striped Partitioning
P0
P0
P1
P2
P1
P3
P0
P2
P1
P2
P3
P3
Cyclic striping
Block striping
5Block Partitioning
P0
P1
P2
P3
P0
P1
P4
P5
P6
P7
P0
P1
P2
P3
P2
P3
P4
P5
P6
P7
Cyclic checkerboard
Block checkerboard
6Block vs. Striped Partitioning
- Scalability?
- Striping is limited to n processors
- Checkerboard is limited to n x n processors
- Complexity?
- Striping is easy
- Block could introduce more dependencies
7Dense Matrix Algorithms
- Transposition
- Matrix - Vector Multiplication
- Matrix - Matrix Multiplication
- Solving Systems of Linear Equations
- Gaussian Elimination
8Matrix Transposition
- The transpose of A is AT such that ATi,j
Aj,i - All elements below the diagonal move above the
diagonal and vice-versa - If we assume unit time to exchange
- Transpose takes (n2 - n)/2
9Transpose
- Consider case where each processor has more than
one element. - Hypothesis
- The transpose of the full matrix can be done by
first sending the multiple element messages to
their destination and then transposing the
contents of the message.
10Transpose (Striped Partitioning)
11Transpose (Block Partitioning)
12(No Transcript)
13Matrix Multiplication
14One Dimensional Decomposition
- Each processor "owns" black portion
- To compute the owned portion of the answer, each
processor requires all of A
15Two Dimensional Decomposition
- Requires less data per processor
- Algorithm can be performed stepwise.
16Broadcast an A sub- matrix to the other
processors in row. Compute Rotate the B
sub- matrix upwards
17Algorithm
Set B' Blocal for j 0 to sqrt(P) -2 in each
row I the (Ij) mod sqrt(P)th task
broadcasts A' Alocal to the other tasks in
the row accumulate A' B' send B' to upward
neighbor done
18Cannons Algorithm
- Broadcasting a submatrix to all who need it is
costly. - Suggestion Shift both submatrices
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Divide and Conquer
App Apq
Bpp Bpq
P0 P1
P2 P3
x
Aqp Aqq
Bqp Bqq
P4 P5
P6 P7
P0 App Bpp P1 Apq Bpq P2 App Bpq P3
Aqp Bqq
P4 Aqp Bpp P5 Aqq Bqp P6 Aqp Bpq P7
Aqq Bqq
23Systems of Linear Equations
- A linear equation in n variables has the form
- A set of linear equations is called a system.
- A solution exists for a system iff the solution
satisfies all equations in the system. - Many scientific and engineering problems take
this form.
a0x0 a1x1 an-1xn-1 b
24Solving Systems of Equations
a0,0x0 a0,1x1 a0,n-1xn-1
b0 a1,0x0 a1,1x1 a1,n-1xn-1
b1 an-1,0x0 an-1,1x1
an-1,n-1xn-1 bn-1
- Many such systems are large.
- Thousands of equations and unknowns
25Solving Systems of Equations
- A linear system of equations can be represented
in matrix form
a0,0 a0,1 a0,n-1 x0
b0 a1,0 a1,1 a1,n-1 x1
b1 an-1,0 an-1,1 an-1,n-1
xn-1 bn-1
Ax b
26Solving Systems of Equations
- Solving a system of linear equations is done in
two steps - Reduce the system to upper-triangular
- Use back-substitution to find solution
- These steps are performed on the system in matrix
form. - Gaussian Elimination, etc.
27Solving Systems of Equations
- Reduce the system to upper-triangular form
- Use back-substitution
28Reducing the System
- Gaussian elimination systematically eliminates
variable xk from equations k1 to n-1. - Reduces the coefficients to zero
- This is done by subtracting a appropriate
multiple of the kth equation from each of the
equations k1 to n-1
29Procedure GaussianElimination(A, b, y) for k
0 to n-1 / Division Step / for j k 1
to n - 1 Ak,j Ak,j / Ak,k yk
bk / Ak,k Ak,k 1 / Elimination Step
/ for i k 1 to n - 1 for j k 1 to
n - 1 Ai,j Ai,j - Ai,k Ak,j
bi bi - Ai,k yk Ai,k
0 endfor endfor end
30Parallelizing Gaussian Elim.
- Use domain decomposition
- Rowwise striping
- Division step requires no communication
- Elimination step requires a one-to-all broadcast
for each equation. - No agglomeration
- Initially map one to to each processor
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Communication Analysis
- Consider the algorithm step by step
- Division step requires no communication
- Elimination step requires one-to-all bcast
- only bcast to other active processors
- only bcast active elements
- Final computation requires no communication.
35Communication Analysis
- One-to-all broadcast
- log2q communications
- q n - k - 1 active processors
- Message size
- q active processors
- q elements required
T (ts twq)log2q
36Computation Analysis
- Division step
- q divisions
- Elimination step
- q multiplications and subtractions
- Assuming equal time --gt 3q operations
37Computation Analysis
- In each step, the active processor set is reduced
by one resulting in
å
-
1
n
-
-
1
k
n
CompTime
0
k
-
2
/
)
1
(
3
n
n
CompTime
38Can we do better?
- Previous version is synchronous and parallelism
is reduced at each step. - Pipeline the algorithm
- Run the resulting algorithm on a linear array of
processors. - Communication is nearest-neighbor
- Results in O(n) steps of O(n) operations
39Pipelined Gaussian Elim.
- Basic assumption A processor does not need to
wait until all processors have received a value
to proceed. - Algorithm
- If processor p has data for other processors,
send the data to processor p1 - If processor p can do some computation using the
data it has, do it. - Otherwise, wait to receive data from processor p-1
40(No Transcript)
41(No Transcript)
42Conclusion
- Using a striped partitioning method, it is
natural to pipeline the Gaussian elimination
algorithm to achieve best performance. - Pipelined algorithms work best on a linear array
of processors. - Or something that can be linearly mapped
- Would it be better to block partition?
- How would it affect the algorithm?