Title: Block LU Factorization Lecture 24
1Block LU FactorizationLecture 24
2Example Case
- Suppose we are faced with the solution of a
linear system Axb - Further suppose
- A is large (dim(A)gt10,000)
- A is dense
- A is full
- We have a sequence of different b vectors.
3Problems
- Suppose we are able to compute the matrix
- It costs N2 doubles to store the matrix
- E.g. for N100,000 we require 76.3 gigabytes of
storage for the matrix alone. - 32 bit processors are limited to 4 gigabytes of
memory - Most desktops (even 64 bit) do not have 76.3
gigabytes - What to do?
4Divide and Conquer
One approach is to assume we have a square number
of processors. We then divide the matrix into
blocks storing one block per processor.
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
P15
5Back to the Linear System
- We are now faced with LU factorization of a
distributed matrix. - This calls for a modified LU routine which acts
on blocks of the matrix. - We will demonstrate this algorithm for one level.
- i.e. we need to construct matrices L,U such that
ALU and we only store single blocks of A,L,U on
any processor.
6Constructing the Block LU Factorization
A00
A01
A02
L00
0
0
U00
U01
U02
A10
A11
A12
L10
1
0
0
?11
?12
A20
A21
A22
L20
0
1
0
?21
?22
First we LU factorize A00 and look for the above
block factorization. However, we need to figure
out what each of the entries are
A00 L00U00 (compute by L00, U00 by LU
factorization) A01 L00U01 gt U01
L00\A01 A02 L00U02 gt U02 L00\A02 A10
L10U00 gt L10 A10/U00 A20 L20U00 gt L20
A20/U00 A11 L10U01 ?11 gt ?11 A11
L10U01..
7cont
A00 L00U00 (compute by L00, U00 by LU
factorization) A01 L00U01 gt U01
L00\A01 A02 L00U02 gt U02 L00\A02 A10
L10U00 gt L10 A10/U00 A20 L20U00 gt L20
A20/U00 A11 L10U01 ?11 gt ?11 A11
L10U01 A12 L10U02 ?12 gt ?12 A12
L10U02 A21 L20U01 ?21 gt ?21 A21
L20U01 A22 L20U02 ?22 gt ?22 A22
L20U02 In the general case Anm Ln0U0m ?nm
gt ?nm Anm Ln0U0m
8Summary First Stage
A00
A01
A02
L00
0
0
U00
U01
U02
A10
A11
A12
L10
1
0
0
?11
?12
A20
A21
A22
L20
0
1
0
?21
?22
First step LU factorize uppermost block
diagonal Second step a) compute U0n L00\A0n
ngt0 b) compute Ln0
An0/U00 ngt0 Third step compute ?nm Anm
Ln0U0m, (n,mgt0)
9Now Factorize Lower SE Block
?11
?12
L11
0
U11
U12
?21
?22
L21
1
0
??22
We repeat the previous algorithm this time on the
two by two SE block.
10End Result
A00
A01
A02
L00
0
0
U00
U01
U02
A10
A11
A12
L10
L11
0
0
U11
U12
A20
A21
A22
L20
L21
L22
0
0
U22
11Matlab Version
12Parallel Algorithm
P0
P1
P2
P0 A00 L00U00 (compute by L00, U00 by LU
factorization) P1 U01 L00\A01 P2 U02
L00\A02 P3 L10 A10/U00 P6 L20 A20/U00 P4
A11 lt- A11 L10U01 P5 A12 lt- A12
L10U02 P7 A21 lt- A21 L20U01 P8 A22 lt-
A22 L20U02 In the general case Anm Ln0U0m
?nm gt ?nm Anm Ln0U0m
P3
P4
P5
P6
P7
P8
13Parallel Communication
P0 L00,U00 lu(A) P1 U01 L00\A01 P2 U02
L00\A02 P3 L10 A10/U00 P6 L20 A20/U00 P4
A11 lt- A11 L10U01 P5 A12 lt- A12
L10U02 P7 A21 lt- A21 L20U01 P8 A22 lt-
A22 L20U02 In the general case Anm Ln0U0m
?nm gt ?nm Anm Ln0U0m
L00 U00
U01
U02
L10
A11
A12
L20
A21
A22
14Communication Summary
P0
P1
P2
L00 U00
U01
U02
P3
P4
P5
L10
A11
A12
P6
P7
P8
L20
A21
A22
P0 L00,U00 lu(A) P1 U01 L00\A01 P2 U02
L00\A02 P3 L10 A10/U00 P6 L20 A20/U00 P4
A11 lt- A11 L10U01 P5 A12 lt- A12
L10U02 P7 A21 lt- A21 L20U01 P8 A22 lt-
A22 L20U02
P0 sends L00 to P1,P2 sends U00 to
P3,P6 P1 sends U01 to P4,P7 P2 sends U02 to
P5,P8 P3 sends L10 to P4,P5 P4 sends L20 to
P7,P8
15Upshot
a
b
c
d
f
e
g
h
1st stage
1st stage
(a) P0 sends L00 to P1,P2 sends U00
to P3,P6 (b) P1 sends U01 to P4,P7 (c) P2
sends U02 to P5,P8 (d) P3 sends L10 to
P4,P5 (e) P4 sends L20 to P7,P8
(f) P4 sends L11 to P5 sends U11 to
P7 (g) P1 sends U12 to P8 (h) P3 sends L21 to
P8
- Notes
- I added an MPI_Barrier purely to separate the LU
factorization and the backsolve. - In terms of efficiency we can see that quite a
bit of time is spent in MPI_Wait compared to
compute time. - The compute part of this code can be optimized
much more making the parallelefficiency even
worse.
16Block Back Solve
- After factorization we are left with the task of
using the distributed L and U to compute the
backsolve
U00 L00
U01
U02
P0
P1
P2
L10
U11 L11
U12
P3
P4
P5
L20
L21
U22 L22
P6
P7
P8
Block distribution of L and U
17Recall
- Given an LU factorization of A namely, L,U such
that ALU - Then we can solve Axb by
- yL\b
- xU\y
18Distributed Back Solve
L00
0
0
y0 y1 y2
b0 b1 b2
L10
L11
0
L20
L21
L22
P0 solve L00y0 b0
send y0 to P3,P6 P3
send L10y0
to P4 P4 solve L11y1 b1-L10y0
send y1 to P7 P6
send L20y0 to
P8\ P7
send L21y1 to P8 P8 solve
L22y2 b2-L20y0-L21y1 Results y0 on P0, y1
on P4, y2 on P8
P0
P1
P2
P3
P4
P5
P6
P7
P8
19Matlab Code
20Back Solve
After the factorization we computed a solution to
Axb
This consists of two distributed block triangular
systems to solve
21Barrier Between Back Solves
This time I inserted an MPI_Barrier call between
the backsolves. This highlights the serial
nature of the backsolves..
22Example Code
- http//www.math.unm.edu/timwar/MA471F03/blocklu.m
- http//www.math.unm.edu/timwar/MA471F03/parlufact
2.c