CS 584 - PowerPoint PPT Presentation

About This Presentation

Title:

CS 584

Description:

How do we partition a matrix for parallel processing? There are two basic ways ... Striping is limited to n processors. Checkerboard is limited to n x n ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:4.0/5.0

Slides: 43

Provided by: quinno

Learn more at: https://students.cs.byu.edu

Category:

Tags: snell

more less

Transcript and Presenter's Notes

Title: CS 584

1
CS 584
2
Dense Matrix Algorithms

There are two types of Matrices
Dense (Full)
Sparse
We will consider matrices that are
Dense
Square

3
Mapping Matrices

How do we partition a matrix for parallel
processing?
There are two basic ways
Striped partitioning
Block partitioning

4
Striped Partitioning
P0
P0
P1
P2
P1
P3
P0
P2
P1
P2
P3
P3
Cyclic striping
Block striping
5
Block Partitioning
P0
P1
P2
P3
P0
P1
P4
P5
P6
P7
P0
P1
P2
P3
P2
P3
P4
P5
P6
P7
Cyclic checkerboard
Block checkerboard
6
Block vs. Striped Partitioning

Scalability?
Striping is limited to n processors
Checkerboard is limited to n x n processors
Complexity?
Striping is easy
Block could introduce more dependencies

7
Dense Matrix Algorithms

Transposition
Matrix - Vector Multiplication
Matrix - Matrix Multiplication
Solving Systems of Linear Equations
Gaussian Elimination

8
Matrix Transposition

The transpose of A is AT such that ATi,j
Aj,i
All elements below the diagonal move above the
diagonal and vice-versa
If we assume unit time to exchange
Transpose takes (n2 - n)/2

9
Transpose

Consider case where each processor has more than
one element.
Hypothesis
The transpose of the full matrix can be done by
first sending the multiple element messages to
their destination and then transposing the
contents of the message.

10
Transpose (Striped Partitioning)
11
Transpose (Block Partitioning)
12
(No Transcript)
13
Matrix Multiplication
14
One Dimensional Decomposition

Each processor "owns" black portion
To compute the owned portion of the answer, each
processor requires all of A

15
Two Dimensional Decomposition

Requires less data per processor
Algorithm can be performed stepwise.

16
Broadcast an A sub- matrix to the other
processors in row. Compute Rotate the B
sub- matrix upwards
17
Algorithm
Set B' Blocal for j 0 to sqrt(P) -2 in each
row I the (Ij) mod sqrt(P)th task
broadcasts A' Alocal to the other tasks in
the row accumulate A' B' send B' to upward
neighbor done
18
Cannons Algorithm

Broadcasting a submatrix to all who need it is
costly.
Suggestion Shift both submatrices

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Divide and Conquer
App Apq
Bpp Bpq
P0 P1
P2 P3

x
Aqp Aqq
Bqp Bqq
P4 P5
P6 P7
P0 App Bpp P1 Apq Bpq P2 App Bpq P3
Aqp Bqq
P4 Aqp Bpp P5 Aqq Bqp P6 Aqp Bpq P7
Aqq Bqq
23
Systems of Linear Equations

A linear equation in n variables has the form
A set of linear equations is called a system.
A solution exists for a system iff the solution
satisfies all equations in the system.
Many scientific and engineering problems take
this form.

a0x0 a1x1 an-1xn-1 b
24
Solving Systems of Equations
a0,0x0 a0,1x1 a0,n-1xn-1
b0 a1,0x0 a1,1x1 a1,n-1xn-1
b1 an-1,0x0 an-1,1x1
an-1,n-1xn-1 bn-1

Many such systems are large.
Thousands of equations and unknowns

25
Solving Systems of Equations

A linear system of equations can be represented
in matrix form

a0,0 a0,1 a0,n-1 x0
b0 a1,0 a1,1 a1,n-1 x1
b1 an-1,0 an-1,1 an-1,n-1
xn-1 bn-1

Ax b
26
Solving Systems of Equations

Solving a system of linear equations is done in
two steps
Reduce the system to upper-triangular
Use back-substitution to find solution
These steps are performed on the system in matrix
form.
Gaussian Elimination, etc.

27
Solving Systems of Equations

Reduce the system to upper-triangular form
Use back-substitution

28
Reducing the System

Gaussian elimination systematically eliminates
variable xk from equations k1 to n-1.
Reduces the coefficients to zero
This is done by subtracting a appropriate
multiple of the kth equation from each of the
equations k1 to n-1

29
Procedure GaussianElimination(A, b, y) for k
0 to n-1 / Division Step / for j k 1
to n - 1 Ak,j Ak,j / Ak,k yk
bk / Ak,k Ak,k 1 / Elimination Step
/ for i k 1 to n - 1 for j k 1 to
n - 1 Ai,j Ai,j - Ai,k Ak,j
bi bi - Ai,k yk Ai,k
0 endfor endfor end
30
Parallelizing Gaussian Elim.

Use domain decomposition
Rowwise striping
Division step requires no communication
Elimination step requires a one-to-all broadcast
for each equation.
No agglomeration
Initially map one to to each processor

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Communication Analysis

Consider the algorithm step by step
Division step requires no communication
Elimination step requires one-to-all bcast
only bcast to other active processors
only bcast active elements
Final computation requires no communication.

35
Communication Analysis

One-to-all broadcast
log2q communications
q n - k - 1 active processors
Message size
q active processors
q elements required

T (ts twq)log2q
36
Computation Analysis

Division step
q divisions
Elimination step
q multiplications and subtractions
Assuming equal time --gt 3q operations

37
Computation Analysis

In each step, the active processor set is reduced
by one resulting in

å
-
1
n
-
-

1
k
n
CompTime

0
k
-

2
/
)
1
(
3
n
n
CompTime
38
Can we do better?

Previous version is synchronous and parallelism
is reduced at each step.
Pipeline the algorithm
Run the resulting algorithm on a linear array of
processors.
Communication is nearest-neighbor
Results in O(n) steps of O(n) operations

39
Pipelined Gaussian Elim.

Basic assumption A processor does not need to
wait until all processors have received a value
to proceed.
Algorithm
If processor p has data for other processors,
send the data to processor p1
If processor p can do some computation using the
data it has, do it.
Otherwise, wait to receive data from processor p-1

40
(No Transcript)
41
(No Transcript)
42
Conclusion

Using a striped partitioning method, it is
natural to pipeline the Gaussian elimination
algorithm to achieve best performance.
Pipelined algorithms work best on a linear array
of processors.
Or something that can be linearly mapped
Would it be better to block partition?
How would it affect the algorithm?

Write a Comment

User Comments (0)