Dense Matrix Algorithms - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Dense Matrix Algorithms

Description:

tw = communication transfer time per word ... Row-wise Striping p = n (2) Data partitioning: Pi has row i of A and element i of x ... – PowerPoint PPT presentation

Number of Views:189

Avg rating:3.0/5.0

Slides: 39

Provided by: asimk

Category:

more less

Transcript and Presenter's Notes

Title: Dense Matrix Algorithms

1
Dense Matrix Algorithms

CS 524 High-Performance Computing

2
Definitions

p number of processors (0 to p-1)
n dimension of array/matrix (0 to n-1)
q number of blocks along one dimension (0 to
q-1)
tc computation time for one flop
ts communication startup time
tw communication transfer time per word
Interconnection network crossbar switch with
bi-directional links

3
Partitioning One Dimension (Striping)
4
Partitioning Both Dimensions (Checkerboard)
5
Matrix Transpose (MT)

AT(i, j) A(j, i) for all i and j
Sequential run-time
do i 0, n-1
do j 0, n-1
B(i, j) A(j, i)
end do
end do
Run time is (n2 n)/2 or n2/2

6
MT - Checkerboard Partitioning (1)
7
MT Checkerboard Partitioning (2)
8
MT Striped Partitioning
9
Matrix-Vector Multiplication (MVM)

MVM y Ax
do i 0, n-1
do j 0, n-1
y(i) y(i) A(i, j)x(j)
end do
end do
Sequential algorithm requires n2 multiplications
and additions
Assuming one flop takes tc time, sequential run
time is 2tcn2

10
Row-wise Striping p n (1)
11
Row-wise Striping p n (2)

Data partitioning Pi has row i of A and element
i of x
Communication Each processor broadcasts its
element of x
Computation Each processor perform n additions
and multiplications
Parallel run time Tp 2ntc p(ts tw) 2ntc
n(ts tw)
Algorithm is cost-optimal as both parallel and
serial cost is O(n2)

12
Row-wise Striping p lt n

Data partitioning Each processor has n/p rows of
A and corresponding n/p elements of x
Communication Each processor broadcasts its
elements of x
Computation Each processor perform n2/p
additions and multiplications
Parallel run time Tp 2tcn2/p pts (n/p)tw
Algorithm is cost-optimal for p O(n)

13
Checkerboard Partitioning p n2 (1)
14
Checkerboard Partitioning p n2 (2)

Data partitioning Each processor has one element
of A only processors in last column have one
element of x
Communication
One element of x from last column to diagonal
processor
Broadcast from diagonal processor to all
processors in column
Global sum of y from all processors in row to
last processor
Computation one multiplication addition
Parallel run time Tp 2tc 3(ts tw)
Algorithm is cost-optimal as serial and parallel
cost is O(n2)
For bus network, communication time is 3n(ts
tw) system is not cost-optimal as cost is O(n3)

15
Checkerboard Partitioning p lt n2

Data partitioning Each processor has n/vp x n/vp
elements of A processors in last column have
n/vp elements of x
Communication
n/vp elements of x from last column to diagonal
processor
Broadcast from diagonal processor to all
processors in column
Global sum of y from all processors in row to
last processor
Computation n2/p multiplications additions
Parallel run time Tp 2tcn2/p 3 (ts tw n/vp)
Algorithm is cost-optimal only if p O(n2)

16
Matrix-Matrix Multiplication (MMM)

C A x B, n x n square matrices
Block matrix multiplication algebraic operations
on sub-matrices or blocks of matrices. This view
of MMM aids parallelization.
do i 0, q-1
do j 0, q-1
do k 0, q-1
Ci,j Ci,j Ai,k x Bk,j
end do end do end do
Number of multiplications additions n3 .
Sequential run time 2tcn3

17
Checkerboard Partitioning q vp

Data partitioning Pi,j has Ai,j and Bi,j blocks
of A and B of dimension n/vp x n/vp
Communication Each processor broadcasts its
submatrix Ai,j to all processors in row each
processor broadcasts its submatrix Bi,j to all
processors in column
Computation Each processor performs nn/vp n/vp
n3/p multiplications additions
Parallel run time Tp 2tcn3/p 2vpts
(n2/p)tw
Algorithm is cost-optimal only if p O(n2)

18
Cannons Algorithm (1)

Memory-efficient version of the checkerboard
partitioned block MMM
At any time, each processor has one block of A
and B
Blocks are cycled after each computation in such
a way that after vp computations the
multiplication is done for Ci,j
Initial distribution of matrices is same as
checkerboard partitioning
Communication
Initial block Ai,j is moved left by i steps
(with wraparound) block Bi,j is moved up by j
steps (with wraparound)
Subsequent vp-1 block Ai,j is moved left by
one step block Bi,j moved up by one step (both
with wraparound)
After vp computation and communication steps the
multiplication is complete for Ci,j

19
Cannons Algorithm (2)
20
Cannons Algorithm (3)
21
Cannons Algorithm (4)

Communication
vp point-to-point communications of size n2/p
along rows
vp point-to-point communications of size n2/p
along columns
Computation over vp steps, each processors
performs n3/p multiplications additions
Parallel run time Tp 2tcn3/p 2vpts
(n2/p)tw
Algorithm is cost-optimal if p O(n2)

22
Foxs Algorithm (1)

Another memory-efficient version of the
checkerboard partitioned block MMM
Initial distribution of matrices is same as
checkerboard partitioning
At any time, each processor has one block of A
and B
Steps (repeated vp times)
Broadcast Ai,i to all processors in the row
Multiply block of A received with resident block
of B
Send the block of B up one step (with wraparound)
Select block Ai,(j1)modvp (where Ai,j is the
block broadcast in the previous step) and
broadcast to all processors in row. Go to 2.

23
Foxs Algorithm (2)
24
Foxs Algorithm (3)

Communication
vp broadcasts of size n2/p along rows
vp point-to-point communications of size n2/p
along columns
Computation Each processor performs n3/p
multiplications additions
Parallel run time Tp 2tcn3/p 2vpts
(n2/p)tw
Algorithm is cost-optimal if p O(n2)

25
Solving a System of Linear Equations

System of linear equations, Ax b
A is dense n x n matrix of coefficients
b is n x 1 vector of RHS values
x is n x 1 vector of unknowns
Solving x is usually done in two stages
First, Ax b is reduced to Ux y, where U is an
unit upper triangular matrix U(i,j) 0 if i gt
j otherwise U(i,j) ? 0 and U(i,i) 1 for 0 i
lt n. This stage is called Gaussian elimination.
Second, the unknowns are solved in reverse order
starting from x(n-1). This stage is called
back-substitution.

26
Gaussian Elimination (1)

do k 0, n-1
do j k1, n-1
A(k, j) A(k, j)/A(k, k)
end do
y(k) b(k)/A(k, k)
A(k, k) 1
do i k1, n-1
do j k1, n-1
A(i, j) A(i, j) A(i, k)A(k, j)
end do
b(i) b(i) A(i, k)y(k)
A(i, k) 0
end do
end do

27
Gaussian Elimination (2)

Computations
Approximately n2/2 divisions
Approximately n3/3 n2/2 multiplications
subtractions
Approx. sequential run time Ts 2tcn3/3

28
Striped Partitioning p n (1)

Data partitioning Each processor has one row of
matrix A
Communication during k (outermost loop)
broadcast of active part of kth (size nk1) row
to processors k1 to n-1
Computation during iteration k (outermost loop)
n k -1 divisions at processor Pk
n k -1 multiplications subtractions for
processors Pi (k lt i lt n)
Parallel run time Tp (3/2)n(n-1)tc nts
0.5n(n-1)tw
Algorithm is cost-optimal since serial and
parallel costs are O(n3)

29
Striped Partitioning p n (2)
30
Striped Partitioning p n (3)
31
Pipelined Version (Striped Partitioning)

In the non-pipelined or synchronous version,
outer loop k is executed in order.
When Pk is performing the division step, all
other processors are idle
When performing the elimination step, only
processors k1 to n-1 are active rest are idle
In pipelined version, the division step,
communication, and elimination step are
overlapped.
Each processor communicates, if it has data to
communicate computes, if it has computations to
be done or waits, if none of these can be done.
Cost-optimal for linear array, mesh and hypercube
interconnection networks that have
directly-connected processors.

32
Pipelined Version (2)
33
Pipelined Version (3)
34
Striped Partitioning p lt n (1)
35
Striped Partitioning p lt n (2)
36
Checkerboard Partitioning p n2 (1)
37
Checkerboard Partitioning p n2 (2)