CS 584 - PowerPoint PPT Presentation

About This Presentation
Title:

CS 584

Description:

How do we partition a matrix for parallel processing? There are two basic ways ... Striping is limited to n processors. Checkerboard is limited to n x n ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:4.0/5.0
Slides: 43
Provided by: quinno
Category:
Tags: snell

less

Transcript and Presenter's Notes

Title: CS 584


1
CS 584
2
Dense Matrix Algorithms
  • There are two types of Matrices
  • Dense (Full)
  • Sparse
  • We will consider matrices that are
  • Dense
  • Square

3
Mapping Matrices
  • How do we partition a matrix for parallel
    processing?
  • There are two basic ways
  • Striped partitioning
  • Block partitioning

4
Striped Partitioning
P0
P0
P1
P2
P1
P3
P0
P2
P1
P2
P3
P3
Cyclic striping
Block striping
5
Block Partitioning
P0
P1
P2
P3
P0
P1
P4
P5
P6
P7
P0
P1
P2
P3
P2
P3
P4
P5
P6
P7
Cyclic checkerboard
Block checkerboard
6
Block vs. Striped Partitioning
  • Scalability?
  • Striping is limited to n processors
  • Checkerboard is limited to n x n processors
  • Complexity?
  • Striping is easy
  • Block could introduce more dependencies

7
Dense Matrix Algorithms
  • Transposition
  • Matrix - Vector Multiplication
  • Matrix - Matrix Multiplication
  • Solving Systems of Linear Equations
  • Gaussian Elimination

8
Matrix Transposition
  • The transpose of A is AT such that ATi,j
    Aj,i
  • All elements below the diagonal move above the
    diagonal and vice-versa
  • If we assume unit time to exchange
  • Transpose takes (n2 - n)/2

9
Transpose
  • Consider case where each processor has more than
    one element.
  • Hypothesis
  • The transpose of the full matrix can be done by
    first sending the multiple element messages to
    their destination and then transposing the
    contents of the message.

10
Transpose (Striped Partitioning)
11
Transpose (Block Partitioning)
12
(No Transcript)
13
Matrix Multiplication
14
One Dimensional Decomposition
  • Each processor "owns" black portion
  • To compute the owned portion of the answer, each
    processor requires all of A

15
Two Dimensional Decomposition
  • Requires less data per processor
  • Algorithm can be performed stepwise.

16
Broadcast an A sub- matrix to the other
processors in row. Compute Rotate the B
sub- matrix upwards
17
Algorithm
Set B' Blocal for j 0 to sqrt(P) -2 in each
row I the (Ij) mod sqrt(P)th task
broadcasts A' Alocal to the other tasks in
the row accumulate A' B' send B' to upward
neighbor done
18
Cannons Algorithm
  • Broadcasting a submatrix to all who need it is
    costly.
  • Suggestion Shift both submatrices

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Divide and Conquer
App Apq
Bpp Bpq
P0 P1
P2 P3

x
Aqp Aqq
Bqp Bqq
P4 P5
P6 P7
P0 App Bpp P1 Apq Bpq P2 App Bpq P3
Aqp Bqq
P4 Aqp Bpp P5 Aqq Bqp P6 Aqp Bpq P7
Aqq Bqq
23
Systems of Linear Equations
  • A linear equation in n variables has the form
  • A set of linear equations is called a system.
  • A solution exists for a system iff the solution
    satisfies all equations in the system.
  • Many scientific and engineering problems take
    this form.

a0x0 a1x1 an-1xn-1 b
24
Solving Systems of Equations
a0,0x0 a0,1x1 a0,n-1xn-1
b0 a1,0x0 a1,1x1 a1,n-1xn-1
b1 an-1,0x0 an-1,1x1
an-1,n-1xn-1 bn-1
  • Many such systems are large.
  • Thousands of equations and unknowns

25
Solving Systems of Equations
  • A linear system of equations can be represented
    in matrix form

a0,0 a0,1 a0,n-1 x0
b0 a1,0 a1,1 a1,n-1 x1
b1 an-1,0 an-1,1 an-1,n-1
xn-1 bn-1

Ax b
26
Solving Systems of Equations
  • Solving a system of linear equations is done in
    two steps
  • Reduce the system to upper-triangular
  • Use back-substitution to find solution
  • These steps are performed on the system in matrix
    form.
  • Gaussian Elimination, etc.

27
Solving Systems of Equations
  • Reduce the system to upper-triangular form
  • Use back-substitution

28
Reducing the System
  • Gaussian elimination systematically eliminates
    variable xk from equations k1 to n-1.
  • Reduces the coefficients to zero
  • This is done by subtracting a appropriate
    multiple of the kth equation from each of the
    equations k1 to n-1

29
Procedure GaussianElimination(A, b, y) for k
0 to n-1 / Division Step / for j k 1
to n - 1 Ak,j Ak,j / Ak,k yk
bk / Ak,k Ak,k 1 / Elimination Step
/ for i k 1 to n - 1 for j k 1 to
n - 1 Ai,j Ai,j - Ai,k Ak,j
bi bi - Ai,k yk Ai,k
0 endfor endfor end
30
Parallelizing Gaussian Elim.
  • Use domain decomposition
  • Rowwise striping
  • Division step requires no communication
  • Elimination step requires a one-to-all broadcast
    for each equation.
  • No agglomeration
  • Initially map one to to each processor

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Communication Analysis
  • Consider the algorithm step by step
  • Division step requires no communication
  • Elimination step requires one-to-all bcast
  • only bcast to other active processors
  • only bcast active elements
  • Final computation requires no communication.

35
Communication Analysis
  • One-to-all broadcast
  • log2q communications
  • q n - k - 1 active processors
  • Message size
  • q active processors
  • q elements required

T (ts twq)log2q
36
Computation Analysis
  • Division step
  • q divisions
  • Elimination step
  • q multiplications and subtractions
  • Assuming equal time --gt 3q operations

37
Computation Analysis
  • In each step, the active processor set is reduced
    by one resulting in

å
-
1
n
-
-

1
k
n
CompTime

0
k
-

2
/
)
1
(
3
n
n
CompTime
38
Can we do better?
  • Previous version is synchronous and parallelism
    is reduced at each step.
  • Pipeline the algorithm
  • Run the resulting algorithm on a linear array of
    processors.
  • Communication is nearest-neighbor
  • Results in O(n) steps of O(n) operations

39
Pipelined Gaussian Elim.
  • Basic assumption A processor does not need to
    wait until all processors have received a value
    to proceed.
  • Algorithm
  • If processor p has data for other processors,
    send the data to processor p1
  • If processor p can do some computation using the
    data it has, do it.
  • Otherwise, wait to receive data from processor p-1

40
(No Transcript)
41
(No Transcript)
42
Conclusion
  • Using a striped partitioning method, it is
    natural to pipeline the Gaussian elimination
    algorithm to achieve best performance.
  • Pipelined algorithms work best on a linear array
    of processors.
  • Or something that can be linearly mapped
  • Would it be better to block partition?
  • How would it affect the algorithm?
Write a Comment
User Comments (0)
About PowerShow.com