Dense Matrix Algorithms - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Dense Matrix Algorithms

Description:

tw = communication transfer time per word ... Row-wise Striping p = n (2) Data partitioning: Pi has row i of A and element i of x ... – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 39
Provided by: asimk
Category:

less

Transcript and Presenter's Notes

Title: Dense Matrix Algorithms


1
Dense Matrix Algorithms
  • CS 524 High-Performance Computing

2
Definitions
  • p number of processors (0 to p-1)
  • n dimension of array/matrix (0 to n-1)
  • q number of blocks along one dimension (0 to
    q-1)
  • tc computation time for one flop
  • ts communication startup time
  • tw communication transfer time per word
  • Interconnection network crossbar switch with
    bi-directional links

3
Partitioning One Dimension (Striping)
4
Partitioning Both Dimensions (Checkerboard)
5
Matrix Transpose (MT)
  • AT(i, j) A(j, i) for all i and j
  • Sequential run-time
  • do i 0, n-1
  • do j 0, n-1
  • B(i, j) A(j, i)
  • end do
  • end do
  • Run time is (n2 n)/2 or n2/2

6
MT - Checkerboard Partitioning (1)
7
MT Checkerboard Partitioning (2)
8
MT Striped Partitioning
9
Matrix-Vector Multiplication (MVM)
  • MVM y Ax
  • do i 0, n-1
  • do j 0, n-1
  • y(i) y(i) A(i, j)x(j)
  • end do
  • end do
  • Sequential algorithm requires n2 multiplications
    and additions
  • Assuming one flop takes tc time, sequential run
    time is 2tcn2

10
Row-wise Striping p n (1)
11
Row-wise Striping p n (2)
  • Data partitioning Pi has row i of A and element
    i of x
  • Communication Each processor broadcasts its
    element of x
  • Computation Each processor perform n additions
    and multiplications
  • Parallel run time Tp 2ntc p(ts tw) 2ntc
    n(ts tw)
  • Algorithm is cost-optimal as both parallel and
    serial cost is O(n2)

12
Row-wise Striping p lt n
  • Data partitioning Each processor has n/p rows of
    A and corresponding n/p elements of x
  • Communication Each processor broadcasts its
    elements of x
  • Computation Each processor perform n2/p
    additions and multiplications
  • Parallel run time Tp 2tcn2/p pts (n/p)tw
  • Algorithm is cost-optimal for p O(n)

13
Checkerboard Partitioning p n2 (1)
14
Checkerboard Partitioning p n2 (2)
  • Data partitioning Each processor has one element
    of A only processors in last column have one
    element of x
  • Communication
  • One element of x from last column to diagonal
    processor
  • Broadcast from diagonal processor to all
    processors in column
  • Global sum of y from all processors in row to
    last processor
  • Computation one multiplication addition
  • Parallel run time Tp 2tc 3(ts tw)
  • Algorithm is cost-optimal as serial and parallel
    cost is O(n2)
  • For bus network, communication time is 3n(ts
    tw) system is not cost-optimal as cost is O(n3)

15
Checkerboard Partitioning p lt n2
  • Data partitioning Each processor has n/vp x n/vp
    elements of A processors in last column have
    n/vp elements of x
  • Communication
  • n/vp elements of x from last column to diagonal
    processor
  • Broadcast from diagonal processor to all
    processors in column
  • Global sum of y from all processors in row to
    last processor
  • Computation n2/p multiplications additions
  • Parallel run time Tp 2tcn2/p 3 (ts tw n/vp)
  • Algorithm is cost-optimal only if p O(n2)

16
Matrix-Matrix Multiplication (MMM)
  • C A x B, n x n square matrices
  • Block matrix multiplication algebraic operations
    on sub-matrices or blocks of matrices. This view
    of MMM aids parallelization.
  • do i 0, q-1
  • do j 0, q-1
  • do k 0, q-1
  • Ci,j Ci,j Ai,k x Bk,j
  • end do end do end do
  • Number of multiplications additions n3 .
    Sequential run time 2tcn3

17
Checkerboard Partitioning q vp
  • Data partitioning Pi,j has Ai,j and Bi,j blocks
    of A and B of dimension n/vp x n/vp
  • Communication Each processor broadcasts its
    submatrix Ai,j to all processors in row each
    processor broadcasts its submatrix Bi,j to all
    processors in column
  • Computation Each processor performs nn/vp n/vp
    n3/p multiplications additions
  • Parallel run time Tp 2tcn3/p 2vpts
    (n2/p)tw
  • Algorithm is cost-optimal only if p O(n2)

18
Cannons Algorithm (1)
  • Memory-efficient version of the checkerboard
    partitioned block MMM
  • At any time, each processor has one block of A
    and B
  • Blocks are cycled after each computation in such
    a way that after vp computations the
    multiplication is done for Ci,j
  • Initial distribution of matrices is same as
    checkerboard partitioning
  • Communication
  • Initial block Ai,j is moved left by i steps
    (with wraparound) block Bi,j is moved up by j
    steps (with wraparound)
  • Subsequent vp-1 block Ai,j is moved left by
    one step block Bi,j moved up by one step (both
    with wraparound)
  • After vp computation and communication steps the
    multiplication is complete for Ci,j

19
Cannons Algorithm (2)
20
Cannons Algorithm (3)
21
Cannons Algorithm (4)
  • Communication
  • vp point-to-point communications of size n2/p
    along rows
  • vp point-to-point communications of size n2/p
    along columns
  • Computation over vp steps, each processors
    performs n3/p multiplications additions
  • Parallel run time Tp 2tcn3/p 2vpts
    (n2/p)tw
  • Algorithm is cost-optimal if p O(n2)

22
Foxs Algorithm (1)
  • Another memory-efficient version of the
    checkerboard partitioned block MMM
  • Initial distribution of matrices is same as
    checkerboard partitioning
  • At any time, each processor has one block of A
    and B
  • Steps (repeated vp times)
  • Broadcast Ai,i to all processors in the row
  • Multiply block of A received with resident block
    of B
  • Send the block of B up one step (with wraparound)
  • Select block Ai,(j1)modvp (where Ai,j is the
    block broadcast in the previous step) and
    broadcast to all processors in row. Go to 2.

23
Foxs Algorithm (2)
24
Foxs Algorithm (3)
  • Communication
  • vp broadcasts of size n2/p along rows
  • vp point-to-point communications of size n2/p
    along columns
  • Computation Each processor performs n3/p
    multiplications additions
  • Parallel run time Tp 2tcn3/p 2vpts
    (n2/p)tw
  • Algorithm is cost-optimal if p O(n2)

25
Solving a System of Linear Equations
  • System of linear equations, Ax b
  • A is dense n x n matrix of coefficients
  • b is n x 1 vector of RHS values
  • x is n x 1 vector of unknowns
  • Solving x is usually done in two stages
  • First, Ax b is reduced to Ux y, where U is an
    unit upper triangular matrix U(i,j) 0 if i gt
    j otherwise U(i,j) ? 0 and U(i,i) 1 for 0 i
    lt n. This stage is called Gaussian elimination.
  • Second, the unknowns are solved in reverse order
    starting from x(n-1). This stage is called
    back-substitution.

26
Gaussian Elimination (1)
  • do k 0, n-1
  • do j k1, n-1
  • A(k, j) A(k, j)/A(k, k)
  • end do
  • y(k) b(k)/A(k, k)
  • A(k, k) 1
  • do i k1, n-1
  • do j k1, n-1
  • A(i, j) A(i, j) A(i, k)A(k, j)
  • end do
  • b(i) b(i) A(i, k)y(k)
  • A(i, k) 0
  • end do
  • end do

27
Gaussian Elimination (2)
  • Computations
  • Approximately n2/2 divisions
  • Approximately n3/3 n2/2 multiplications
    subtractions
  • Approx. sequential run time Ts 2tcn3/3

28
Striped Partitioning p n (1)
  • Data partitioning Each processor has one row of
    matrix A
  • Communication during k (outermost loop)
  • broadcast of active part of kth (size nk1) row
    to processors k1 to n-1
  • Computation during iteration k (outermost loop)
  • n k -1 divisions at processor Pk
  • n k -1 multiplications subtractions for
    processors Pi (k lt i lt n)
  • Parallel run time Tp (3/2)n(n-1)tc nts
    0.5n(n-1)tw
  • Algorithm is cost-optimal since serial and
    parallel costs are O(n3)

29
Striped Partitioning p n (2)
30
Striped Partitioning p n (3)
31
Pipelined Version (Striped Partitioning)
  • In the non-pipelined or synchronous version,
    outer loop k is executed in order.
  • When Pk is performing the division step, all
    other processors are idle
  • When performing the elimination step, only
    processors k1 to n-1 are active rest are idle
  • In pipelined version, the division step,
    communication, and elimination step are
    overlapped.
  • Each processor communicates, if it has data to
    communicate computes, if it has computations to
    be done or waits, if none of these can be done.
  • Cost-optimal for linear array, mesh and hypercube
    interconnection networks that have
    directly-connected processors.

32
Pipelined Version (2)
33
Pipelined Version (3)
34
Striped Partitioning p lt n (1)
35
Striped Partitioning p lt n (2)
36
Checkerboard Partitioning p n2 (1)
37
Checkerboard Partitioning p n2 (2)
  • Data partitioning Pi,j has element A(i, j) of
    matrix A
  • Communication during iteration k (outermost loop)
  • Broadcast of A(k, k) to processor (k, k1) to (k,
    n-1) in the kth row
  • Broadcast of modified A(i,k) along ith row for k
    i lt n
  • Broadcast of modified A(k,j) along jth column for
    k j lt n
  • Computation during iteration k (outermost loop)
  • One division at Pk,k
  • One multiplication subtraction at processors
    Pi,j (k lt i ,jlt n)
  • Parallel run time Tp (3/2)n(n-1)tc nts
    0.5(n-1)tw
  • Algorithm is cost-optimal since serial and
    parallel costs are O(n3)

38
Back-Substitution
  • Solution of Ux y, where U is unit upper
    triangular matrix
  • do k n-1, 0
  • x(k) y(k)
  • do i k-1, 0
  • y(i) y(i) x(k)U(i,k)
  • end do
  • end do
  • Computation approx. n2/2 multiplications
    subtractions
  • Parallel algorithm is similar to that for the
    Gaussian elimination stage
Write a Comment
User Comments (0)
About PowerShow.com