Title: Parallel Algorithms for Unstructured and Dynamically Varying Problems
1Parallel Algorithms for Unstructured and
Dynamically Varying Problems
- Background and Related Work
- motivation (appl., model, performance)
- steps towards best solution
- problem classification
- discuss various approaches
- Mapping Algorithms onto Architectures
- grid-oriented problems (main issues)
- structured grid-oriented problems
- unstructured grid-oriented problems
- Conclusions and Future Work
2Motivation
- Unstructured and dynamic problems
- abstractions of various phenomena
- astrophysics, molec. biology, chemistry
- fluid dynamics, electromagnetics,
- large, computationally intensive
- Applications
- solutions to boundary integral equations
- wave propagation, fluid flow
- transitions from 3-D to 2-D
- prediction (Schrodinger 1st, 2nd)
- local density approximations
- large least-squares problems
- image proc., molec. biology, astrophysics
3Motivation (continued)
- Goal model real event
- simulations, prediction, impact
- accurate solutions
- Need high performance solutions
- performance evaluations
- hard to compare (differ. Parameters)
- metrics (ExT, Sp, E, IsoE, ConvR)
- simple, accurate, fast, effective
4Performance of Parallel Algorithms
- Theoretical order analysis (not enough)
- best solutions restricted applications
- nonrealistic assumptions about resources
- PE number, speed problem size
- Empirical testing (optimal solution diff)
- need to consider various factors
- FlattKen89 given overhead, problem size,
parallel execution time minimum for a unique
number of processors.
Studies conclude converg rate and par.
efficiency det granularity and choice of alg. - BarrHick93 graph (best solution vs time)
- Performance degradation (loss degree)
5Steps towards Best Performance Solution
- Depends upon the best choice of
- parallel algorithm (simple, accurate, fast)
- parallel architecture (fast, effective)
- mapping (detect, partitioning, scheduling)
- Mapping -- goals (minimize factors)
- computation time (numerical efficiency)
- communication (comm/comp ratio)
- load imbalance (effective use resources)
- overhead (synch, comm, sched)
- identify stochastic behavior -- algorithm
design - Best performance (choice of parameters)
- find dominant component
- Mapping -- considerations
- problems domain distrib (pattern, density, size)
- topology interconn net, char each comp.
6Problem Classification
- Pattern of data points distribution
- structured (uniform)
- synchronous
- unstructured (nonuniform)
- loosely synchronous
- asynchronous
- Density of data points
- dense (high density)
- sparse (low density)
- Solutions grids (math description, data struct.)
structured, unstructured, semistructured
7Problem ClassificationDepending on the pattern
of data points distribution
Regularity
Structured
Unstructured
Distribution
space
time
uniform
regular
regular
synchronous
--
Loosely synchronous
irregular
regular
--
nonuniform
irregular
irregular
asynchronous
--
8Problem ClassificationDepending on the density
of data points distribution
Density
Structured
Unstructured
high
dense
dense
low
(sparse)
sparse
9Structured Problems
- synchronous, reg space and time
- uniform distribution data points
- parall easy detect, express, implement
- naturally expressed (vector, matrix)
- compiler mapp constructs, operations
- eg. QCD simulations, chemistry, ...
10Unstructured Problems
- loosly synchronous (irr space, reg time)
- asynchronous (irregular space and time)
- nonuniform distribution data points
- irregularities difficult detect, express
- irregularities hard to implement (develop)
- need flexible HW, fast communication
11Loosely Synchronous Problems
- Dominant methodology irregular sci simul
- irreg static, data par. Over sparse struct
- irreg not too hard detect, express
- hierarchical data structure, sparse matrices
- success express irreg performance gains
- irreg hard to implement
- high-level data struct, geometric decomp
- run better on MIMD than on SIMD
- different data points, distinct alg.
- Eg periodic interact heterogenious objects
- time-driven simul, stat phys, particle dyn
- adaptive meshes, biology, image processing
- Monte Carlo, clustering alg N-body problem
- spacial structure may change dynamically
- need synchronization each iteration setp
- need adaptive algorithms
12Asynchronous
- irreg hard detect, express, implement
- hard parallelize (unless embarassingly)
- Eg event-driven simul, chess, market analys
- irregularities dynamic (cannot exploit)
- cannot use simple mappings (communication,
decomposition) - object-oriented approach (flex commun)
- statistical methods load balancing
13Grid-Oriented Problems
- Structured grids
- simple (each node same procedure)
- low overhead
- hard to create (complex domains)
- Unstructured grids
- easy create, adapt, effective
- no need propagate local features
- large overhead, more storage
- difficult solution
- convergence rate may be slow
- Semistructured grids
- domain unstructured union of structured
subdomains - use GroppKeyes92, PAFMA LeaBoa92
14 15Dense and Sparse Problems
- Structured, Unstructured
- various approaches
16Dense Problems
- Matrix problems
- well or not well determined solutions
- indices not necessarily linear order
- Vandermonde, Toeplitz, orthog structure
- well conditioned accurate regardless computation
method - ill-conditioned can be highly accurate if alg.
computes small residuals, sol. det right hand
side - role ill-cond and cond estimators
- Edel93 paradox
- improved multiply, transpose, inverse
17Sparse Problems
- undirected weighted graphs
- compressed rows
- vector rows, nested pairs column-value
- different multiply, transpose, inverse
- det eigenval, eigenvect sparse matrix
- using O(n) matrix-vector multiply Yau93
18- Figures about performance of PDE sparse solvers
on hypercubes
19Dense Approaches to Sparse Problems
- Sparse problems contain dense blocks
- extract, process regular structure
eg. FEBA alg for sparse
matrix-vector multiplication Agar92 - direct matrix factorization
- decomposition smaller dense, loss performance,
communic router Kratzer92, mapp QR different
that mapp Cholesky
20Sparse Approaches to Dense Problems
- Recently successful, seem counterintuitive
- General methods Edel93, Freu92Reev93
- access matrix through matrix-vector mul.
- Look for preconditioners
- replace O( ) matrix-vector multiplicat. With
approximations (multipole, multigrid) - PAFMA -- LeaBoa92
- nonuniform problem divided uniform reg.
- take advantage regular commun patterns
- processor assignment in subregions with density
variance. When (1) load imbalance assign dense
regions (2)
communication assign sparse regions
21Stochastic Nature of Parallel Algorithms
- Variability run-time behavior alg.
- Solution unpredictable, multiple paths
- path nondeterministic, optimal path chosen
run-time, results divergent number pivots, sol
time, alternate optima - Race conditions (time dependent decisions)
- in algorithms design
- same problem, strategy, operating cond
- alternate optima, different timing event,
incoming variables, choices, sequence of points
traversed - Eg. self-scheduled nonuniform problems
- highly efficient machine utilization
22Stochastic Nature of Parallel Algorithms
(continued)
- Some examples
- parallel network simplex, good load balance,
variability run-time behavior - branch-and-bound for integer programm
- good bounds affects portion of search tree
explored - loops without dependencies among their iteration
but sensitive other iter, OS, appl - Factoring HumSch, Fly92, scheduling good load
balance, overhead, scalable, resistant iteration
variance
23Mapping Algorithms onto Architectures
- Parallelism detection
- depends upon algorithm and problem nature
- independent of architecture
- study data dependency explicit, implicit
- partitioning (problem decomposition)
- task into processes, identify sharing objects
- allocation (distribution task to processors)
- influenced by memory org, interconnect
- scheduling (ordering tasks on processors)
- depends upon interconnect and PE charact
24Mapping Goals(partitioning, scheduling)
- minimize communication exploit locality
- minimize load imbalance adaptive refinement
25Partitioning
- Granularity of process coarse enough for target
machine without loosing parallelism - partitioning technique Sarkar89
- Start initial fine granularity use heuristics to
merge processes until coarsest partition reached
use cost function (depends upon critical path,
overhead)
26Scheduling
- Goal spread load evenly on PEs (efficiency)
- static vs dynamic allocation
- low overhead, inflexible vs high overhead,
flexible - centralized vs distributed scheduling
- centralized mediation SmithSchnabel92 mix
- hierarchical mediation strategy
- Note compiler, automatic tools for partitioning,
scheduling - run-time (flexible, large overhead)
- compile time (low overhead, need technology, good
if easy estimate)
27The Mapping Problem
- Assigning tasks to PEs s.t. minimal
- Model HammondSchreiber92
- given work each PE equal G ( , ) task
graph (vertex process, edges interprocess
commun) H ( , ) PE graph dshortest
path. - Find surjection ? ? s.t.
communication load minimized ?
- good results need partition, alloc, schedul (if
isolated poor mappings, non-optimal time)
28The Mapping Problem(continued)
- Mapping of communication pattern strategy
(efficient) Guptaschenfeld93 - switching locality (sparse nature of
communication graphs) - each process switches its communication between a
small set of other process -- ICN PEs grouped in
small clusters intercluster and intracluster
connectivity - identify this partitioning problem with bounded l
-- contraction of graph RamKri9 (partitioning
of vertex set into subsets such that no subset
contains more than l vertices and every subset
has at least as many vertices as the number of
subsets is connected to) - simulated annealing good results
29 30Mapping of Grid-Oriented Problems
- grid points geometrically adjacent
- partitioning grid into subdomains (subgrids)
- assigning them to PEs each PE perform
computation associated with subdomain - dependency communication restricted to
perimeters of subdomains - Model RooseDries93 -- structured grid
- time integration of finite-difference, finite
volume discretization PDE on structured grid - 2D struct grid partitioned in subdomains of equal
size - each PE performs locally updates for interior
grid points - boundary grid points need neighbor information
31- Figures about grid partitioning and communication
requirements
32Analysis (communication, load balance, numerical
efficiency)
- Communication overhead depends upon
- size subdomains (large subdomains, small
overhead) 2D perim to surf, 3D surf to vol ratio - communic vol proportional to perim subdomain
- for fixed-size subdomain perim min if square
- blockwise partitioning in square subregions leads
to min communication volume - communication requirements not always isotropic
- machine characterisitcs
- influence communication to computation ration
- problem characteristics
- dense problems load imbalance predominates
- sparse problems communication predominates
33Analysis (continued)
- Load Imbalance
- minimal, when block partitioning in square
subdomains - work per grid may vary (cannot predict load
imbalance) - math models differ in various parts of the domain
- boundary regions Work differs from interior
cells cells distributed almost equal among
processors, achieved when subdomains square
34Analysis (continued)
- Numerical Efficiency
- accuracy results differ (depending algorithm,
problem nature) - Numerical properties of inherently parallel
algorithms not affected by partitioning strategy
(e.g. Jacobi relaxation as opposed to
Gauss-Seidel which has better convergence
properties) - Runge-Kutta update overlap regions only after
complete integration step (omitting some
communication) high speedup, small convergence
degradation blocks determined by PEs and not
domain geometry - Block tridiagonal systems (Thomas sequential,
pipelined) parallel solvers use Gaussian
elimination, cyclic reduction distribution of
sequential parts over PEs at the expense of
increased communication
35Analysis (continued)
- Numerical Efficiency (continued)
- domain decomposition algorithm for PDEs (contain
algorithmic overhead) - Schwartz domain decomposition overlapping
subdomains (iterative process, approx boundaries) - Schur Complement non-overlapping subdomains
(borders computed first)
36Hierarchical Nature of Multigrid Algorithms
- Obtaining acceptable performance levels require
optimization techniques which address the
characteristics of each architectural class
Matheson Tarjan93 - Model study for each architectural class
- Conclusions
- fine grain machines (high variable communication
cost) - optimize domain to PE topology mapping
- medium grain machines (high fixed communication
cost) - optimize domain partition (well-shaped, perim
minim, neighbors small) - Parallel algorithms require accurate subspace
decomposition and large PEs to provide practical
alternatives to standard algorithms.
37Optimal Partitioning -- structured grids
- Structured problems (use structured grids)
- unstructured problems (structured on subdomains)
- Regular comput 2D mesh MIMD Lee90
- workload and communic pattern same at each point
- communication depends upon total communic
required, actual pattern of communic ( communic
neighbors, underlying architecture) - CCR important factor in performance evaluation
(given stencil, generate best shape) - CCR depends upon stencil, partition shape, gird
(e.g. diamond -5p, hexagon -7p, square -9p star)
(CCR max does not guarantee minim exec time CCR
better performance indicator in shared-memory
than message passing CCR proportional to aspect
ratio of partition)
38Optimal Partitioning -- structured grids
(continued)
- Optimal partitionings ReedAdamsPatrick87
- formalize relationship stencil, shape,
underlying architecture - isolated evaluation of components yields
suboptimal performance - message-passing small vs large packets have
opposite order results (stencil, shape) - type of interconnect net important (grid to net
mapping must support interpartition communic
pattern -- otherwise performance degradation)
39 40Mapping Algorithms for Unstructured Grid-Oriented
Problems
- Grids structured, unstructured static,
adaptive single level, multilevel (multigrid) - Partitioning
- static grid size work fixed during comput
- dynamic grid size work change
- iterative static (quasi-dynamic) uses adaptive
refinement (grid fixed for each iteration load
balancing after each grid refinement overhead) - Mapping unstructured grids (complex)
- patition grid into subgrids if communication
- low, use fine grids (subgrids PEs)
- high, use coarse grids (subgrids gtgt PEs)
41- Figures about Element Airfoil
42Requirements for Mapping
- Subgrids dependency topology to match machine
communic topology - easy for fully connected machine in general
nearest-neighbor fast, decreases risk of link
contention) - subgrids mutualy dependent mapped onto PEs that
communicate easy (fast, close) - time for local comput depend upon grid points
(if each point same amount work, then each
subgrid same points) - time for interface operations depend on points
on the boundary of subgrid (length of boundary
and adjacent subgrids should be minimized)
43Classification of Partitioning algorithms for
Unstructured Grids
- Optimization techniques (cost function)
- heuristic techniques
- clustering techniques
- graph partitioning
- geometry based techniques
44Optimization techniques
- Minimize cost function -- mapping comunic, load
imbalance - Global combinatorial optimization problem
- grid n points mapped onto p processors
- search space cardinality (impractical)
- total enumeration, symmetry, branch-and-bound
(not practical enough) - Suboptimal solutions
- simulated annealing (expensive) Kirckpatrick83
- genetic algorithms (robust, large amount time)
45Heuristic Techniques
- Simplify model (emphasize critical aspects)
- do not minimize comunic load imbalance
simultaneously (model one term explicitly while
the other is implicitly guiding the search) - e.g. explicitly minimize communication while
implicitly providing equally sized subgrids
46Clustering
- Form clusters with high intra-cluster
dependencies low inter-cluster communication - nearest-neighbor mapping algorithm Sadayappan,
et al.90 - group grid points into clusters
- assign clusters to PEs such that nearest-neighbor
points are reassigned to same or nearest PE - explicit minimize load imbalance (boundary
refinement) - implicit minimize commnic by search strategy.
- bandwidth reduction algorithms Farhat92
- reduce local bandwidth at expense of bad aspect
ratio (sparse matrix technique of reordering the
equations and unknowns)
47Graph Partitioning
- Grid-oriented problems described by dependency
graphs - Grid partitioning problems formulated as graph
partitioning problems - (each vertex correspondes grid point weight
vertex weight piont weight edge communic
vol) - given G (V, E) undirected graph vertices
(equal weights) edges edge
weights p positive integers n1,,np s.t.
- Partition vertex set V into p disjoing subsets
n1,,np (or n1 np) s.t. sum of weights of
edges connecting different subsets in minimal. - NP complete (heuristics, near-optimal solutions)
- Kernighan-Lin algorithms (graph with equal vertex
weights partitioned into 2 more subgraphs) - parallelized graph partitioning algorithm for
message passing MIMD Gilber, et al.87
48Spectral Bisection Algorithm
- Based upon results from spectral graph theory in
which eigenvectors of a matrix used to bisect
graph - applied to interdependency graph of a grid
- initial algorithm Pothen, et al.90
- graph bisected using eigenvector corresp to 2-nd
smallest eigenvalue of Laplacian matrix of graph - algorithm for weighted vertices edges
HendricksonLeland93 - graph partitioned into eight subgraphs at once
applied recursively to subgraphs - improved alg various edge weights RoooseDri93
- better partitions for dynamic unstructured grids
with good load balance - reduces remapping elements by extending initial
graph with virtual vertices
49- Figures about RSB algorithms
50Geometry Based Techniques
- Partitioning based on geometrical info
- geometrical info about grid points
- grid bisected in subgrids with equal grid points
(applied recursively) - less expensive, produces acceptable partitionings
- spectral bisection alg better for large grid
points and higher accuracy (expensive) - ORB (half bodies n each subspace)
- bisection along x, y, z coordinates
- largest dimension chosen to split
- used by FMA, AFMA
51Geometry Based Techniques (continued)
- Inertial recursive bisection
- partitionings do not depend upon the system of
coordinates used but upon problem characteristics - grid will be bisected orthogonal to the principal
inertia direction. - Principal inertia direction of a set of objects
is the direction for which rotation inertia
momentum is minimal when the direction is taken
as the rotation axis. - expensive but accurate results
52- Figures about partitionings
53Comparison between Spectral Bisection and
Coordinate Bisection
54Optimal Partitioning Techniques
- Static load balancing Berger87
- Binary decomposition (rectangles equal work, easy
data structure) - optimal depth
determined by mapping communication costs onto
different architectures - cardinality (quality of mapping)
edges of
problem graph that fall on edges of processor
graph divided by total edges (cardinality1,
minimum communications) - regular domains are easy (map binary gray code on
gypercube) - irregular domains -- difficult (neighbor regions
not on neighbor nodes)
55Optimal Partitioning Techniques (continued)
- Dynamic load balancing
- Nearest-neighbor (move grad. Towards balance)
- partial workload info, modify domains iteratively
- uses statistical methods for analysis too slow
- Hierarchical master-slave DragonGustafson89
- ideal balance at synchronization points
(hypercube) - divide conquer (time, space, effectiveness)
- all nodes same work within granularity of task
size - each node in subcube knows the extreme values
- if redistribution work needed (subcube with less
work takes its neighbor extreme particles,
subcube with more work removes extreme particle
a new boundary established) - process moves to a finer level
56Scheduling techniques(improving load balancing)
- Factoring (parallel loop iter) Hummel, et al.92
- minim synchronization (coord global scheduling)
- iterates dynamic scheduled decreasing chunks
(sizes decrease exponentially rather than
linearly) - granularity large (self-sched), small
(static-sched) - Tiling (statistical methods) Ramsadayapan90
- minimize communication
- study stencil, partition shape, interconn topol
- Fractiling (factoring tiling) Hummel93
- load balancing (min synchron, min communic)
- chunk-size (factoring rule)
- chunk-shape (tiling)
- exploit properties found in fractals