Title: Domain decomposition in parallel computing
1Domain decomposition in parallel computing
COT 5405 Fall 2006
- Ashok Srinivasan
- www.cs.fsu.edu/asriniva
- Florida State University
2Outline
- Background
- Geometric partitioning
- Graph partitioning
- Static
- Dynamic
- Important points
3Background
- Tasks in a parallel computation need access to
certain data - Same datum may be needed by multiple tasks
- Example In matrix-vector multiplication, b2 is
needed for the computation of all ci2, 1 lt i lt n - If a process does not own a datum needed by its
task, then it has to get it from a process that
has it - This communication is expensive
- Aims of domain decomposition
- Distribute the data in such a manner that the
communication required is minimized - Ensure that the computational loads on
processes are balanced
4Domain decomposition example
- Finite difference computation
- New value of a node depends on old values of its
neighbors
- We want to divide the nodes amongst the processes
so that - Communication is minimized
- Measure of partition quality
- Computational load is evenly balanced
5Geometric partitioning
- Partition a set of points
- Uses only coordinate information
- Balances the load
- The heuristic tries to ensure that communication
costs are low - Algorithms are typically fast, but partition not
of high quality - Examples
- Orthogonal recursive bisection
- Inertial
- Space filling curves
6Orthogonal recursive bisection
- Recursively bisect orthogonal to the longest
dimension - Assume communication is proportional to the
surface area of the domain, and aligned with
coordinate axes - Recursive bisection
- Divide into two pieces, keeping load balanced
- Apply recursively, until desired number of
partitions obtained
7Inertial
- ORB may not be effective if cuts along the x, y,
or z directions are not good ones - Inertial
- Recursively bisect orthogonal to the inertial axis
8Space filling curves
- Space filling curves
- A continuous curve that fills the space
- Order the points based on their relative position
on the curve - Choose a curve that preserves proximity
- Points that are close in space should be close in
the ordering too - Example
- Hilbert curve
9Hilbert curve
- Sources
- http//www.dcs.napier.ac.uk/andrew/hilbert.html
- http//www.fractalus.com/kerry/tutorials/hilbert/h
ilbert-tutorial.html
10Domain decomposition with a space filling curve
- Order points based on their position on the curve
- Divide into P parts
- P is the number of processes
- Space filling curves can be used in adaptive
computations too - They can be extended to higher dimensions too
11Graph partitioning
- Model as graph partitioning
- Graph G (V, E)
- Each task is represented by a vertex
- A weight can be used to represent the
computational effort - An edge exists between tasks if one needs data
owned by the other - Weights can be associated with edges too
- Goal
- Partition vertices into P parts such that each
partition has equal vertex weights - Minimize the weights of edges cut
- Problem is NP hard
- Edge cut metric
- Judge the quality of the partitioning by the
number of edges cut
12Static graph partitioning
- Combinatorial
- Levelized nested dissection
- Kernighan-Lin/Feduccia-Matheyses
- Spectral partitioning
- Multi-level methods
13Combinatorial partitioning
- Use only connectivity information
- Examples
- Levelized nested dissection
- Kernighan-Lin/Feduccia-Matheyses
14Levelized nested dissection (LND)
- Idea is similar to the geometric methods
- But cannot use coordinate information
- Instead of projecting vertices along the longest
axis, order them based on distance from a vertex
that may be one extreme of the longest dimension
of a graph - Pseudo-peripheral vertex
- Perform a breadth-first search, starting from an
arbitrary vertex - The vertex that is encountered last might be a
good approximation to a peripheral vertex
15LND example Finding a pseudoperipheral vertex
3
2
3
2
1
3
1
2
Initial vertex
1
3
4
Pseudoperipheral vertex
16LND example Partitioning
5
6
3
4
5
2
5
4
2
3
1
Partition
Initial vertex
Recursively bisect the subgraphs
17Kernighan-Lin/Fiduccia-Matheyses
- Refines an existing partition
- Kernighan-Lin
- Consider pairs of vertices from different
partitions - Choose a pair whose swapping will result in the
best improvement in partition quality - The best improvement may actually be a worsening
- Perform several passes
- Choose best partition among those encountered
- Fiduccia-Matheyses
- Similar but more efficient
- Boundary Kernighan-Lin
- Consider only boundary vertices to swap
- ... and many other variants
18Kernighan-Lin example
Swap these
Better partition Edge cut 3
Existing partition Edge cut 4
19Spectral method
- Based on the observation that a Fiedler vector of
a graph contains connectivity information - Laplacian of a graph L
- lii di (degree of vertex i)
- lij -1 if edge i,j exists, otherwise 0
- Smallest eigenvalue of L is 0 with eigenvector
all 1 - All other eigenvalues are positive for a
connected graph - Fiedler vector
- Eigenvector corresponding to the second smallest
eigenvalue
20Fiedler vector
- Consider a partitioning of V into A and B
- Let yi 1 if vi e A, and yi -1 if vi e B
- For load balance, Si yi 0
- Also Seij e E (yi-yj)2 4 x number of edges
across partitions - Also, yTLy Si di yi2 2 Seij e E yiyj
- Seij e E (yi-yj)2
21Optimization problem
- The optimal partition is obtain by solving
- Minimize yTLy
- Constraints
- yi e -1,1
- Si yi 0
- This is NP hard
- Relaxed problem
- Minimize yTLy
- Constraints
- Si yi 0
- Add a constraint on a norm of y, example, y2
n0.5 - Note
- (1, 1, ..., 1)T is an eigenvector with eigenvalue
0 - For a connected graph, all other eigenvalues are
positive and orthogonal to this eigenvector,
which implies Si yi 0 - The objective function is minimized by a Fiedler
vector
22Spectral algorithm
- Find a Fiedler vector of the Laplacian of the
graph - Note that the Fiedler value (the second smallest
eigenvalue) yields a lower bound on the
communication cost, when the load is balanced - From the Fiedler vector, bisect the graph
- Let all vertices with components in the Fiedler
vector greater than the median be in one
component, and the rest in the other - Recursively apply this to each partition
- Note Finding the Fiedler vector of a large graph
can be time consuming
23Multilevel methods
- Idea
- It takes time to partition a large graph
- So partition a small graph instead!
- Three phases
- Graph coarsening
- Combine vertices to create a smaller graph
- Example Find a suitable matching
- Apply this recursively until a suitably small
graph is obtained - Partitioning
- Use spectral or another partitioning algorithm to
partition the small graph - Multilevel refinement
- Uncoarsen the graph to get a partitioning of the
original graph - At each level, perform some graph refinement
24Multilevel example(without refinement)
9
10
5
7
3
11
2
4
8
12
16
1
1
6
15
13
14
25Multilevel example(without refinement)
9
10
5
7
3
11
2
4
8
12
16
1
1
6
15
13
14
26Multilevel example(without refinement)
9
10
5
7
3
1
1
2
11
1
2
1
2
2
4
8
1
12
16
1
1
1
6
15
1
13
14
27Multilevel example(without refinement)
9
10
5
7
3
1
1
2
11
1
2
1
2
2
4
8
1
12
16
1
1
6
15
1
13
14
28Multilevel example(without refinement)
9
10
5
7
3
1
1
2
11
1
2
1
2
2
4
8
1
12
16
1
1
6
15
1
13
14
1
2
2
1
29Dynamic partitioning
- We have an initial partitioning
- Now, the graph changes
- Determine a good partition, fast
- Also minimize the number of vertices that need
to be moved - Examples
- PLUM
- Jostle
- Diffusion
30PLUM
- Partition based on the initial mesh
- Vertex and edge weights alone changed
- Map partitions to processors
- Use more partitions than processors
- Ensures finer granularity
- Compute a similarity matrix based on data already
on a process - Measures savings on data redistribution cost for
each (process, partition) pair - Choose assignment of partitions to processors
- Example Maximum weight matching
- Duplicate each processor of partitions/P times
- Alternative Greedy approximation algorithm
- Assign in order of maximum similarity value
- http//citeseer.nj.nec.com/oliker98plum.html
31JOSTLE
- Use Hu and Blakes scheme for load balancing
- Solve Lx b using Conjugate Gradient
- L Laplacian of processor graph, bi Weight on
process Pi Average weight - Move max(xi-xj, 0) weight between Pi and Pj
- Leads to balanced load
- Equivalent to Pi sending xi load to each neighbor
j, and each neighbor Pj sending xj to Pi - Net loss in load for Pi di xi - Sneighborj xj
L(i)x bi - where L(i) is row i of L, and di is degree of i
- New load for Pi weight on Pi - bi average
weight - Leads to minimum L2 norm of load moved
- Using max(xi-xj, 0)
- Select vertices to move, based on relative gain
- http//citeseer.nj.nec.com/walshaw97parallel.html
32Diffusion
- Involves only communication with neighbors
- A simple scheme
- Processor Pi repeatedly sends a wi weight to each
neighbor - wi weight on Pi
- wk (I a L) wk-1 , wk weight vector at
iteration k - Simple criteria exist for choosing a to ensure
convergence - Example a 0.5/(maxi di),
- More sophisticated schemes exist
33Important points
- Goals of domain decomposition
- Balance the load
- Minimize communication
- Space filling curves
- Graph partitioning model
- Spectral method
- Relax NP hard integer optimization to floating
point, and then discretize to get approximate
integer solution - Multilevel methods
- Three phases
- Dynamic partitioning additional requirements
- Use old solution to find new one fast
- Minimize number of vertices moved