Title: Sparse LA
1Sparse LA
2Motivation
- Sparse computations much more challenging than
dense due to complex data structures and memory
references - Many physical systems produce sparse matrices
- Most of the research and the base case in sparse
symmetric positive definite matrices
3Sparse Cholesky
- To solve Ax b
- A LLT Ly b LTx y
- Cholesky factorization introduces fill-in
4Column oriented left-looking Cholesky
5Fill-in
Fill new nonzeros in factor
6Permutation Matrix or Ordering
- Thus ordering to reduce fill or to enhance
numerical stability - Choose permutation matrix P so that Cholesky
factor L of PAPT has less fill than L. - Triangular solve
- Ly Pb LTz y x PTz
- The fill can be predicted in advance
- Static data structure can be used symbolic
factorization
7Steps
- Ordering
- Find a permutation P of matrix A,
- Symbolic factorization
- Set up a data structure for the Cholesky factor L
of PAPT, - Numerical factorization
- Decompose PAPT into LLT,
- Triangular system solution
- Ly Pb LTz y x PTz.
8Sparse Matrices and Graph Theory
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
1
3
6
3
6
3
6
6
5
5
5
5
7
7
7
7
4
4
4
4
2
2
G(A)
9Sparse and Graph
1
1
1
2
2
2
3
3
3
4
4
4
5
5
5
6
6
6
7
7
7
6
6
5
7
7
7
F(A)
10Ordering
- The above order of elimination is natural
- The first heuristic is minimum degree ordering
- Simple and effective
- But efficiency depends on tie breaking strategy
11Minimum degree ordering for the previous Matrix
1
2
3
4
5
6
7
6
Ordering 2,4,5,7,3,1,6 No fill-in !
5
7
3
4
2
1
12Ordering
- Another ordering is nested dissection
(divide-and-conquer) - Find separator S of nodes whose removal (along
with edges) divides the graph into 2 disjoint
pieces - Variables in each pieces are numbered
contiguously and variables in S are numbered last - Leads to bordered block diagonal non-zero pattern
- Can be applied recursively
13Nested Dissection Illustration
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
S
14Nested Dissection Illustration
1
2
21
11
12
3
4
22
13
14
5
6
23
15
16
7
8
24
17
18
9
10
25
19
20
S
15Symbolic factorization
- Can simulate the numerical factorization
- Struct(Mi) klti mik ? 0,
- Struct(Mj) kgtj mkj ? 0,
- p(j) mini ? Struct(Lj), if Struct(Lj)
? 0, - j otherwise
- Struct(Lj) C Struct(Lp(j))) U p(j,
- Struct(Lj) Struct(Aj) U
(UiltjStruct(Li)p(i) j) j -
16Symbolic Factorization
- for j 1 to n do
- Rj 0
- for j 1 to n do
- S Struct(Aj)
- for i ? Rj do
- S S U Struct(Li) j
- Struct(Lj) S
- if Struct(Lj) ? 0 then
- p(j) mini ? Struct(Lj)
- Rp(j) Rp(j) U j
17Numerical Factorization
cmod(j, k) modification of column j by column k,
k lt j cdiv(j) division of column j by a scalar
18Algorithms
19Elimination Tree
- T(A) has an edge between two vertices i and j,
with i gt j, if i p(j). i is the parent of j.
10
9
4
5
8
2
7
3
6
1
20Supernode
- A set of contiguous columns in the Cholesky
factor L that share essentially the same sparsity
structure. - The set of contiguous columns j,j1,,jt
constitutes a supernode if - Struct(L,k) Struct(L,k1) U k1 for
jltkltjt-1 - Columns in the same supernode can be treated as a
unit - Used for enhancing the efficiency of minimum
degree ordering and symbolic factorization.
21Parallelization of Sparse Cholesky
- Most of the parallel algorithms are based on
elimination trees - Work associated with two disjoint subtrees can
proceed independently - Same steps associated with sequential sparse
factorization - One additional step assignment of tasks to
processors
22Ordering
- 2 issues
- - ordering in parallel
- - find an ordering that will help in
parallelization in the subsequent steps
23Ordering in Parallel Nested dissection
- Nested dissection can be carried in parallel
- Also leads to elimination trees that can be
parallelized during subsequent factorizations - But parallelization only in the later levels of
dissection - Can be applied to only limited class of problems
- More later.
24Ordering for Parallel Factorization
- No agreed objective for ordering for parallel
factorization Not all orderings that reduce
fill-in can provide scope for parallelization
7
1
2
3
6
4
5
5
6
7
4
Natural order
3
2
No fill. No scope for parallelization
1
Elimination Tree
25Example (Contd..)
1
1
2
3
3
4
5
2
6
7
7
Nested dissection order
7
4
3
6
6
1
2
4
5
5
Elimination Tree
Fill. But scope for parallelization
26Ordering for parallel factorization Tree
restructuring
- Decouple fill reducing ordering and ordering for
parallel elimination - Determine a fill reducing ordering P of G(A)
- Form the elimination tree T(PAPT)
- Transform this tree T(PAPT) to one with smaller
height and record the corresponding equivalent
reordering, P
27Ordering for parallel factorization Tree
restructuring
- Efficiency depends on if such an equivalent
reordering can be found - Also on the limitations of the initial ordering,
P - Only minor modifications to the initial ordering.
Hence only limited improvement in parallelism - Algorithm by Liu (1989) based on elimination tree
rotation to reduce the height - Algorithm by Jess and Kees (1982) based on
chordal graph to reduce the height
28Chordal graph and equivalent ordering
- A graph is chordal if every cycle of length 4 or
more has a chord, i.e. an edge containing two
non-consecutive vertices of the cycle - Non-deficient or zero-deficient or simplical node
a node in a graph whose adjacent set is a
clique.
29Properties of Chordal Graphs, Some Preliminaries
- Filled graph, G(F) corresponding to filled matrix
F of A is chordal - A chordal graph has at least one perfect
elimination ordering (no fills)
30Chordal graph and equivalent ordering
- Given Matrix A, some ordered matrix A PAPT,
fill matrix, F, elimination tree, T(A), and
graphs G(A), G(F) - To solve find an equivalent reordering Pb to P
such that the fill matrix Fb corresponding to Ab
PbAPbT has no additional fill-ins to F and the
elimination tree T(Ab) has the minimum height (or
for exploiting parallel elimination) - or determining perfect elimination ordering Pb
for the chordal graph G(F) such that the
elimination tree T(Ab) has the minimum height (or
for exploiting parallel elimination)
31Properties of Chordal Graphs, Some Preliminaries
- A chordal graph has atleast one simplical node.
- If the graph is not a clique, there are atleast
two such independent (non-adjacent) simplical
nodes. - The subgraph of simplical nodes with no
deficiency consists of disconnected cliques.
32Perfect Ordering
- Elimination of a simplical node has been shown to
create no filled edge - Thus, to find perfect ordering, keep finding and
eliminating simplical nodes at each step
33Jess and Kees- Algorithm
Parallel_Elimination(G) begin G0 G i
0 while Gi ? Ø do begin Ri
a maximum independent subset of non-deficient
nodes order nodes in Ri next
Gi1 Gi Ri (eliminate the nodes of Ri from
G) i i1 end end.
34Jess and Kees - Example
c
f
R3 R2 R1 R0
e
g
g
d
a
b
d
b
e
h
- Step i Selected Ri
- 0 a,c,f,h
- b,d
- g
- e
c
a
f
h
35Jess and Kees Algorithm Minimum Height Property
- This algorithm produces an elimination tree of
minimum height, h, among all perfect reordering
of filled matrix F. - Can be proven using induction principle
- Considering elimination tree without the leaf
nodes assuming that reduced tree satisfies
minimum height h-1. - Considering full elimination tree proving that
the addition of leaf nodes doesnt change the
minimum height property and yields h.
36Height and Parallel Completion time
- Not all elimination trees with minimum heights
give rise to small parallel completion times. - Let each node, v, in elimination tree associated
with x,y - x timev or time for factorization of column v
- y levelv
- levelv timev if v is the root of
elimination tree, timevlevelparent of v
otherwise - Represents minimum time to completion starting at
node v - Parallel completion time maximum level value
among all nodes
37Height and Parallel Completion time
h
i
5, 5
9
f
a
b
c
e
d
f
g
g
e
5
8
3, 8
6, 11
3,3
e
9
h
d
7
3, 11
4
6, 17
3, 6
5, 8
d
f
8
4
c
i
4, 21
3
3, 14
c
g
6
7
3, 9
3
6, 14
3, 17
b
2
b
h
3, 12
6
2
6, 20
a
2, 19
5
i
1
a
4, 24
2, 14
1
38Minimization of Cost
- Thus some recent algorithms (Weng-Yang Lin, J. of
Supercomputing 2003) pick at each step, the
nodes with the minimum cost (greedy approach)
39Nested Dissection Algorithms
- Use a graph partitioning heuristic to obtain a
small edge separator of the graph - Transform the small edge separator into a small
node separator - Number nodes of the separator last and
recursively apply
40Algorithm 1 - Level Structures
- d(x, y) distance between x and y
- Eccentricity e(x) maxyin Xd(x, y)
- Diameter d(G) max of eccentricities
- Peripheral node x in X whose e(x) d(G)
- Level structure a partitioning L L0,..,Ll)
such that Adj(Li) C Li-1 U Li1
41Example
2
1
6
8
3
5
4
7
6
1
8
2
4
3
7
5
Level structure rooted at 6
42Breadth First Search
- One way of finding level structures is by BFS
starting with the peripheral node - Finding peripheral node is expensive. Hence
settle for pseudo-peripheral
43Pseudo peripheral node
- Pick arbitrary node r in X
- Generate a level structure with e(r) levels
- Choose node x in last level with minimum degree
- Generate a level structure rooted at x
- If e(x) gt e(r) r x, go to step 3 else x is the
pseudo peripheral
r
1
1
2
2
1
2
2
2
3
1
4
r
2
4
3
2
1
3
r
4
3
1
2
44ND Heuristic based on BFS
- Construct a level structure with l levels
- Form separator S of nodes in level (l1)/2
- Recursively apply
45(No Transcript)
46(No Transcript)
47Example
A
B
a
b
A
B
b
a
48(No Transcript)
49K-L for ND
- Form a random initial partition
- Form edge separator by applying K-L to form
partitions P1 and P2 - Let V1 in P1 such that nodes in V1 incident on
atleast one edge in the separator set. Similarly
V2 - V1 U V2 (wide node separator),
- V1 or V2 (narrow node separator) by Gilbert and
Zmijewski (1987)
50Step 2 Mapping Problems on to processors
- Based on elimination trees
- But elimination trees are determined from the
structure of L which happens in symbolic
factorization (step 3) bootstrapping problem ! - Efficient algorithms exist to find elimination
trees from the structure of A. - Parallel calculation of elimination tree by
zmijewski and Gilbert where each processor
computes local version of elimination tree and
then the local versions are combined. - Various strategies to map columns to processors
based on elimination trees. - Strategy 1 Successive levels in elimination tree
are wrap-mapped onto processors - Strategy 2 Subtree-to-Subcube
- Strategy 3 Bin-Pack by Geist and Ng
51Strategy 1
2
1
0
3
2
1
0
0
3
3
2
2
1
1
1
1
0
0
0
0
0
0
0
0
52Strategy 2 Subtree-to-subcube mapping
- Select an appropriate set of P subtrees of the
elimination tree, say T0, T1 - Assign columns corresponding to Ti to Pi
- Where two subtrees merge into a single subtree,
their processor sets are merged together and
wrap-mapped onto the nodes/columns of the
separator that begins at that point. - The root separator is wrap-mapped onto the set of
all processors.
53Strategy 2
0
1
2
3
0
1
0
2
1
3
0
2
0
1
2
3
0
0
1
1
2
2
3
3
54Strategy 3 Bin-Pack (Geist and Ng)
- Try to find disjoint subtrees
- Map the subtrees to p bins based on
first-fit-decreasing bin-packing heuristic - Subtrees are processed in decreasing order of
workloads - A subtree is packed into the current lightest bin
- Weight imbalance, a ratio between lightest and
heaviest bin - If a gt user-specified tolerance, ?, stop
- Else explore the heaviest subtree from the tree
and split into subtrees and p bins are repacked
using bin-packing again - Repeat until a gt ? or the largest subtree cannot
be split further - Load balance based on user-specified tolerance
- For the remaining nodes from the roots of the
subtrees to the root of the tree, wrap map.
55Parallel symbolic factorization
- Sequential symbolic factorization is very
efficient - Not able to achieve good speedups with parallel
versions Limited parallelism, small task sizes,
high communication overhead - Mapping strategies typically wrap-map processors
to the columns of same supernode hence more
storage and work than sequential version - For supernodal structure, only the processor
holding the 1st column of the supernode
calculates the structure - Other processors holding other columns of
supernode simply retrieve the structure from the
processor holding the first column.
56Parallel Numerical Factorization Submatrix
Cholesky
Tsub(k)
Tsub(k) is partitioned into various subtasks
Tsub(k,1),,Tsub(k,P) where Tsub(k,p)
cmod(j,k) j C Struct(Lk) n mycols(p)
57Definitions
- mycols(p) set of columns owned by p
- mapk processor containing column k
- procs(Lk) mapj j in Struct(Lk)
58Parallel Submatrix Cholesky
- for j in mycols(p) do
- if j is a leaf node in T(A) do
- cdiv(j)
- send Lj to the processors in procs(Lj)
- mycols(p) mycols(p) j
- while mycols(p) ? 0 do
- receive any column of L, say Lk
- for j in Struct(Lk) n mycols(p) do
- cmod(j, k)
- if column j required no more cmods do
- cdiv(j)
- send Lj to the processors in procs(Lj)
- mycols(p) mycols(p) j
- Disadvantages
- Both Lk and Struct(Lk) have to be sent
- Communication is not localized
59Parallel Numerical Factorization Sub column
Cholesky
Tcol(j) is partitioned into various subtasks
Tcol(j,1),,Tcol(j,P) where Tcol(j,p) aggregates
into a single update vector every update vector
u(j,k) for which k C Struct(Lj) n mycols(p)
60Definitions
- mycols(p) set of columns owned by p
- mapk processor containing column k
- procs(Lj) mapk k in Struct(Lj)
- u(j, k) scaled column accumulated into the
factor column by cmod(j, k)
61Parallel Sub column Cholesky
- for j 1 to n do
- if j in mycols(p) or Struct(Lj) n mycols(p) ?
0 do - u 0
- for k in Struct(Lj) n mycols(p) do
- u u u(j,k)
- if mapj ? p do
- send u to processor q mapj
- else
- incorporate u into the factor column j
- while any aggregated update column for
column j remains unreceived do - receive in u another aggregated update
column for column j - incoprporate u into the factor column j
- cdiv(j)
-
Has uniform and less communication than sub
matrix version Difference is due to accessing
pattern of Struct(Lj) and Struct(Lj)
62A refined version compute-ahead fan-in
- The previous version can lead to processor idling
due to waiting for the aggregates for updating
column j - Updating column j can be mixed with compute-ahead
tasks - Aggregate u(i, k) for i gt j for each completed
column k in Struct(Li) n mycols(p) - Receive aggregate update column for i gt j and
incorporate into factor column i
63Triangular SolveParallel Forward and Back
Substitution (Anshul Gupta, Vipin Kumar
Supercomputing 95)
64Forward Substitution
- Computation starts with leaf supernodes of the
elimination trees - The portion of L corresponding to a supernode is
a dense trapezoid of width t and height n - t - number of nodes/columns in supernode
- n - number of non-zeros in the leftmost column of
the supernode
65Forward Substitution - Example
66Steps at Supernode
- Initial processing - A vector, rhs of size n is
formed. - The 1st t elements correspond to the elements in
RHS vector with the same indices as nodes of the
supernode. - The remaining n-t elements are filled with 0s.
- Computation
- Solve dense triangular system at the top of the
trapezoid in the supernode. - Form updates corresponding to remaining n-t rows
of the supernode - Vector x product of (bottom (n-t)xt submatrix
of L, vector of size t containing solutions from
step 1) - Subtract x from bottom n-t elements of rhs
- Add bottom n-t elements of rhs with corresponding
(same index) entries of rhs at parent supernode - Step 2.1 at any supernode can begin after
contributions from all its children
67Parallelization
- For levels gt logP, the above steps are performed
sequentially on a single processor - For supernode with 0 lt l lt logP, the above
computation steps are performed in parallel on
p/2l processors. - Pipelined or wavefront algorithm is used.
68Partitioning
- Assuming unlimited parallelism
- At a single time step, only t processors are
used. - At a single time step, only one block per row and
one block per column are active - Might as well use 1-D block cyclic.
691-D block cyclic along rows
70Redistribution
- In this scheme, the conclusion is that
scalability is achieved with 1-D block-cyclic - But numerical factorizations use 2-D block cyclic
- Hence redistribution has to be performed.
- Claim Redistribution cost of the same order as
triangular solve
71Sparse Iterative Methods
72Iterative Direct methods Pros and Cons.
- Iterative methods do not give accurate results.
- Convergence cannot be predicted
- But absolutely no fills.
73Parallel Jacobi, Gauss-Seidel, SOR
- For problems with grid structure (1-D, 2-D etc.),
Jacobi is easily parallelizable - Gauss-Seidel and SOR need recent values. Hence
ordering of updates and sequencing among
processors - But Gauss-Seidel and SOR can be parallelized
using red-black ordering or checker board
742D Grid example
13
14
15
16
9
10
11
12
5
6
7
8
1
2
3
4
75Red-Black Ordering
- Color alternate nodes in each dimension red and
black - Red nodes can be updated simultaneously followed
by simultaneous black nodes updates - In general, reordering can affect convergence
762D Grid example Red Black Ordering
15
7
16
8
5
13
6
14
11
3
12
4
1
9
2
10
77Multi-Color orderings
- In general multi-color orderings for an arbitrary
graph - Ordering can lead to reduced convergence rate
but can lead to more parallelism - Need to strike a balance
- Multi-color orderings can also be used for
pre-conditioned CG
78Pre-conditioned CG
- Instead of solving Ax b
- Solve Ax b where
- A C-1AC-1,
- x Cx,
- b C-1b
- to improve convergence
- M C2 is called the pre-conditioner
79Incomplete Cholesky Preconditioner
- M HHT where H is the incomplete Cholesky
factor of A - One way of incomplete Cholesky Have hij 0
when aij 0
80Pre-Conditioned CG
- k 0
- r0 b Ax0
- while (rk ? 0)
- Solve Mzk rk (2 triangular solves
- Parallelization is
not straightforward) - k k1
- if k 1
- p1 z0
- else
- ßk rk-1Tzk-1/rk-2Tzk-2
- pk zk-1 ßkpk-1
- end
- ak rk-1Tzk-1/pkTApk
- xk xk-1 akpk
- rk rk-1 akApk
- end
- x xk
81Graph Coloring
- Graph Colored Ordering for parallel computing of
Gauss-Seidel and applying incomplete Cholesky
preconditioners - It was shown (Schreiber and Tang) that minimum
number of parallel steps in triangular solve is
given by the chromatic number of symmetric graph - Thus permutation matrix, P based on graph color
ordering - Incomplete Cholesky applied to PAPT
- Unknowns corresponding to nodes of same color are
solved in parallel computation proceeds in steps
82Parallel Triangular Solve based on Multi-Coloring
- Triangular solve Ly b ( 2 steps )
- bw bw Lwvyv (Corresponds to traversing the
edge ltv, wgt) - yw bw / Lww (Corresponds to visiting vertex w)
- The steps can be done in parallel for all v with
same color - Thus parallel triangular solve proceeds in steps
equal to the number of colors
3, 2
2, 7
1, 1
7, 9
4, 3
6, 8
10, 10
5, 4
9, 6
8, 5
New Order
Original Order
83Graph Coloring Problem
- Given G(A) (V, E)
- s V 1,2,,s is s-coloring of G if s(i) ?
s(j) for every (i, j) edge in E - Minimum possible value of s is chromatic number
of G - Graph coloring problem is to color nodes with
chromatic number of colors - NP-complete problem
84Heuristics Greedy Heuristic
- 1. Compute a vertex ordering v1,,vn for V
- 2. For i 1 to n, set s(vi) equal to smallest
available consistent color - How to do step 1?
3
4
1
2
5
1
3
2
4
5
Non optimal! Leads to more colors. Hence step 1
is important.
85Heuristics Saturation Degree Ordering
- Let v1,..,vi-1 have been chosen
- Choose vi such that vi is adjacent to maximum
number of different colors in v1,..,vi-1
86Parallel graph Coloring General algorithm
87Parallel Graph Coloring Finding Maximal
Independent Sets Luby (1986)
- I null
- V V
- G G
- While G ? empty
- Choose an independent set I in G
- I I U I X I U N(I) (N(I)
adjacent vertices to I) - V V \ X G G(V)
- end
- For choosing independent set I (Monte Carlo
Heuristic) - For each vertex, v in V determine a distinct
random number p(v) - v in I iff p(v) gt p(w) for every w in adj(v)
- Color each MIS a different color
- Disadvantage
- Each new choice of random numbers requires a
global synchronization of the processors.
88Parallel Graph Coloring Enhancement by Jones
and Plassmann (1993)
- Partition the graph G into p partitions for p
processors using some graph partitioning
algorithm - VS vertices incident with separator edges
- VL V \ VS
- ViS Vi n VS for proc i
- ViL Vi n VL for proc i
- Algorithm
- Color G(VS) using the asynchronous Monte Carlo
heuristic - On processor i, color G(ViL) given colors of ViS
using a sequential heuristic
89Parallel Graph Coloring Gebremedhin and Manne
(2003)
Pseudo-Coloring
90References in Graph Coloring
- M. Luby. A simple parallel algorithm for the
maximal independent set problem. SIAM Journal on
Computing. 15(4)1036-1054 (1986) - M.T.Jones, P.E. Plassmann. A parallel graph
coloring heuristic. SIAM journal of scientific
computing, 14(3) 654-669, May 1993 - L. V. Kale and B. H. Richards and T. D. Allen.
Efficient Parallel Graph Coloring with
Prioritization, Lecture Notes in Computer
Science, vol 1068, August 1995, pp 190-208.
Springer-Verlag. - A.H. Gebremedhin, F. Manne, Scalable parallel
graph coloring algorithms, Concurrency Practice
and Experience 12 (2000) 1131-1146. - A.H. Gebremedhin , I.G. Lassous , J. Gustedt ,
J.A. Telle, Graph coloring on coarse grained
multicomputers, Discrete Applied Mathematics,
v.131 n.1, p.179-198, 6 September 2003
91References
- M.T. Heath, E. Ng, B.W. Peyton. Parallel
Algorithms for Sparse Linear Systems. SIAM
Review. Vol. 33, No. 3, pp. 420-460, September
1991. - A. George, J.W.H. Liu. The Evolution of the
Minimum Degree Ordering Algorithm. SIAM Review.
Vol. 31, No. 1, pp. 1-19, March 1989. - J. W. H. Liu. Reordering sparse matrices for
parallel elimination. Parallel Computing 11
(1989) 73-91
92References
- Anshul Gupta, Vipin Kumar. Parallel algorithms
for forward and back substitution in direct
solution of sparse linear systems. Conference on
High Performance Networking and Computing.
Proceedings of the 1995 ACM/IEEE conference on
Supercomputing (CDROM). - P. Raghavan. Efficient Parallel Triangular
Solution Using Selective Inversion. Parallel
Processing Letters, Vol. 8, No. 1, pp. 29-40,
1998
93References
- Joseph W. H. Liu. The Multifrontal Method for
Sparse Matrix Factorization. SIAM Review. Vol.
34, No. 1, pp. 82-109, March 1992. - Gupta, Karypis and Kumar. Highly Scalable
Parallel Algorithms for Sparse Matrix
Factorization. TPDS. 1997.