Dense Linear Algebra (Data Distributions) - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Dense Linear Algebra (Data Distributions)

Description:

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 20
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: Dense Linear Algebra (Data Distributions)


1
Dense Linear Algebra(Data Distributions)
  • Sathish Vadhiyar

2
Gaussian Elimination - Review
  • Version 1
  • for each column i
  • zero it out below the diagonal by adding
    multiples of row i to later rows
  • for i 1 to n-1
  • for each row j below row i
  • for j i1 to n
  • add a multiple of row i to row j
  • for k i to n
  • A(j, k) A(j, k) A(j, i)/A(i, i) A(i,
    k)

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
3
Gaussian Elimination - Review
  • Version 2 Remove A(j, i)/A(i, i) from inner
    loop
  • for each column i
  • zero it out below the diagonal by adding
    multiples of row i to later rows
  • for i 1 to n-1
  • for each row j below row i
  • for j i1 to n
  • m A(j, i) / A(i, i)
  • for k i to n
  • A(j, k) A(j, k) m A(i, k)

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
4
Gaussian Elimination - Review
  • Version 3 Dont compute what we already know
  • for each column i
  • zero it out below the diagonal by adding
    multiples of row i to later rows
  • for i 1 to n-1
  • for each row j below row i
  • for j i1 to n
  • m A(j, i) / A(i, i)
  • for k i1 to n
  • A(j, k) A(j, k) m A(i, k)

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
5
Gaussian Elimination - Review
  • Version 4 Store multipliers m below diagonals
  • for each column i
  • zero it out below the diagonal by adding
    multiples of row i to later rows
  • for i 1 to n-1
  • for each row j below row i
  • for j i1 to n
  • A(j, i) A(j, i) / A(i, i)
  • for k i1 to n
  • A(j, k) A(j, k) A(j, i) A(i, k)

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
6
GE - Runtime
  • Divisions
  • Multiplications / subtractions
  • Total

1 2 3 (n-1) n2/2 (approx.)
12 22 32 42 52 . (n-1)2 n3/3 n2/2
2n3/3
7
Parallel GE
  • 1st step 1-D block partitioning along blocks of
    n columns by p processors

i
0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
8
1D block partitioning - Steps
  • 1. Divisions

n2/2
2. Broadcast
xlog(p) ylog(p-1) zlog(p-3) log1 lt n2logp
3. Multiplications and Subtractions
(n-1)n/p (n-2)n/p . 1x1 n3/p (approx.)
Runtime
lt n2/2 n2logp n3/p
9
2-D block
  • To speedup the divisions

P
i
0 0 0 0 0 0
k
0 0 0 0 0
i
Q
i,i X X x
j
10
2D block partitioning - Steps
  • 1. Broadcast of (k,k)

logQ
2. Divisions
n2/Q (approx.)
3. Broadcast of multipliers
xlog(P) ylog(P-1) zlog(P-2) . n2/Q logP
4. Multiplications and subtractions
n3/PQ (approx.)
11
Problem with block partitioning for GE
  • Once a block is finished, the corresponding
    processor remains idle for the rest of the
    execution
  • Solution? -

12
Onto cyclic
  • The block partitioning algorithms waste processor
    cycles. No load balancing throughout the
    algorithm.
  • Onto cyclic

0
1
2
3
0
2
3
0
2
3
1
1
0
1-D block-cyclic
2-D block-cyclic
cyclic
Load balance, block operations, but column
factorization bottleneck
Has everything
Load balance
13
Block cyclic
  • Having blocks in a processor can lead to
    block-based operations (block matrix multiply
    etc.)
  • Block based operations lead to high performance
  • Operations can be split into 3 categories based
    on number of operations per memory reference
  • Referred to BLAS levels

14
Basic Linear Algebra Subroutines (BLAS) 3
levels of operations
  • Memory hierarchy efficiently exploited by higher
    level BLAS

BLAS Memory Refs. Flops Flops/Memory refs.
Level-1 (vector) yyax Zy.x 3n 2n 2/3
Level-2 (Matrix-vector) yyAx A A(alpha) xyT n2 2n2 2
Level-3 (Matrix-Matrix) CCAB 4n2 2n3 n/2
15
Gaussian Elimination - Review
i
  • Version 4 Store multipliers m below diagonals
  • for each column i
  • zero it out below the diagonal by adding
    multiples of row i to later rows
  • for i 1 to n-1
  • for each row j below row i
  • for j i1 to n
  • A(j, i) A(j, i) / A(i, i)
  • for k i1 to n
  • A(j, k) A(j, k) A(j, i) A(i, k)

0 0 0 0 0 0
k
0 0 0 0 0
i
i,i X X x
j
What GE really computes for i1 to n-1
A(i1n, i) A(i1n, i) / A(i, i) A(i1n,
i1n) - A(i1n, i)A(i, i1n)
i
Finished multipliers
A(i,i)
A(i,k)
i
A(i, i1n)
Finished multipliers
A(j,i)
A(j,k)
BLAS1 and BLAS2 operations
A(i1n, i)
A(i1n, i1n)
16
Converting BLAS2 to BLAS3
  • Use blocking for optimized matrix-multiplies
    (BLAS3)
  • Matrix multiplies by delayed updates
  • Save several updates to trailing matrices
  • Apply several updates in the form of matrix
    multiply

17
Modified GE using BLAS3Courtesy Dr. Jack
Dongarra
  • for ib 1 to n-1 step b / process matrix b
    columns at a time /
  • end ibb-1
  • / Apply BLAS 2 version of GE to get A(ibn,
    ibend) factored.
  • Let LL denote the strictly lower triangular
    portion of A(ibend, ibend)1 /
  • A(ibend, end1n) LL-1A(ibend, end1n) /
    update next b rows of U /
  • A(end1n, end1n) A(end1n, end1n) -
    A(ibn, ibend) A(ibend, end1n)
  • / Apply delayed updates with single matrix
    multiply /

b
ib
end
Completed part of U
A(ibend, ibend)
ib
Completed part of L
b
end
A(end1n, end1n)
A(end1n, ibend)
18
GE MiscellaneousGE with Partial Pivoting
  • 1D block-column partitioning which is better?
    Column or row pivoting
  • 2D block partitioning Can restrict the pivot
    search to limited number of columns
  • Column pivoting does not involve any extra steps
    since pivot search and exchange are done locally
    on each processor. O(n-i-1)
  • The exchange information is passed to the other
    processes by piggybacking with the multiplier
    information
  • Row pivoting
  • Involves distributed search and exchange
    O(n/P)O(logP)

19
Triangular Solve Unit Upper triangular matrix
  • Sequential Complexity -
  • Complexity of parallel algorithm with 2D block
    partitioning (P0.5P0.5)

O(n2)
O(n2)/P0.5
  • Thus (parallel GE / parallel TS) lt (sequential
    GE / sequential TS)
  • Overall (GETS) O(n3/P)
Write a Comment
User Comments (0)
About PowerShow.com