Title: Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems
1Designing an Efficient Partitioning Algorithm for
Grid Environments with Application to N-Body
Problems
- Daniel J. Harvey
- Department of Computer Science
- Southern Oregon University
- E-mail harveyd_at_sou.edu
- Sajal K. Das
- Department of Computer Science and Engineering
- The University of Texas at Arlington
- E-mail das_at_cse.uta.edu
- Rupak Biswas
- NASA Ames Research Center
- E-mail rbiswas_at_nas.nasa.gov
2Presentation Overview
- The information power grid (IPG)
- The MinEX partitioner
- This papers contributions
- Metrics utilized
- The N-Body problem
- MinEX refinements
- Experimental study
- Performance results
- Conclusions and on-going research
3The Information Power Grid (IPG)
- Harness power of geographically separated
resources - Developed by NASA and other collaborative
partners - Utilize geographically separated processors to
solve large-scale computational problems - Characteristics
- limited bandwidth and high latency
- heterogeneous configurations
- Relevant applications identified by I-Way
experiment - Remote access to large databases requiring
high-end graphics - Remote virtual reality access to instruments
- Remote interactions with super-computer
simulations
4Load Balancing Approaches Especially important
in grid environments
- Traditional Load Balancing Objectives
- Distribute workload evenly among processors
- Minimize idle time
- Static load-balancing
- Balance load prior to execution
- Examples smart-compilers, schedulers
- Dynamic load-balancing
- Balance as application is processed
- Examples adaptive contracting, gradient,
symmetric broadcast networks - Semi-dynamic load-balancing (Our focus in this
paper) - Temporarily stop application processing to
balance workload - Utilizes a partitioning technique
- Examples MeTiS, Jostle, PLUM
5The MinEX Partitioner
- We previously introduced a novel partitioner
called MinEX - Minex A latency-tolerant dynamic partitioner
for grid computing applications, FGCS, 18 (2002),
pp. 477489 - MinEXs unique characterisitcs include
- Environment designed specifically for
heterogeneous geographically distributed
environments - Grid maps configuration graph onto the partition
graph produces partitions reflecting the grid - Goal minimize runtime rather than balance
processing workload and minimize edge cut - Latency accounts for latency tolerance during
partitioning - Accounts for data movement communication
overhead
6This Papers Contributions
- To compare MinEX performance to METIS, a state
the art partitioner - Result Speed of execution is competitive
- Result Quality of partitions reduce application
runtime by up to a factor of 6 - Estimate performance utilizing a wide range of
heterogeneous grid configurations - Apply MinEX to a real-life application (the
N-Body problem) executing in simulated grid
environments - Introduce refinements to our initial algorithm
7The MinEX Partitioner
- Multi-level scheme
- Collapse edges incrementally
- Partitions the contracted graph
- Refines the graph in reverse
- Reassignments executed to improve partition
quality - Creates diffusive or from scratch partitions
- User-supplied function estimates solver latency
tolerance - Accounts for data redistribution cost during
partitioning
8Metrics Utilized
- Processing weight
- Wgt PWgtv x Procc
- Communication cost
- Comm
- Swep CWgt(v,w) x Connect(c,d)
- Redistribution cost
- Remap
- RWgtv x Connect(c,d) if p? q
- Weighted queue length
- QWgt(p) Svep (Wgt Comm Remap )
- Heaviest load (MaxQWgt)
- Qlenp Vertices e p
- Average load (WSysLL)
- Total system load QWgtToT SpePQWgt(p)
- Imbalance factor
- LoadImb MaxQWgt/WSysLL
v p
v p
v p
v p
9MinVar, Gain andThroTTle
- Processor workload variance from WSysLL
- Var Sp(QWgt(p) - WSysLL)2
- DVar reflects the improvement in MinVar after a
vertex reassignment. A positive value implies
that the Var value has increased - Gain is the change(DQWgtToT) to total system load
resulting from a vertex reassignment - ThroTTle is a user defined parameter. If Gaingt0,
Vertex moves that improve DVar are allowed if
Gain2/-DVar lt ThroTTle
10The N-Body Problem
- Classical problem of simulating movement of a set
of bodies - Based upon gravitational or electrostatic forces
- Iterates over a series of time steps
- At each step for each body
- Compute forces from all other bodies using the
gravitational laws - Calculates Acceleration and integrates twice to
compute the position at the next time step - If all the force calculations are formed, O(n2)
computations are required at each time step.
11Barnes Hut Solution (Framework for experiments)
- Reduces computational complexity from O(n2) to
O(n lg n) - Clusters of bodies that are far from a cell are
treated as a single body using the total center
of mass and the center of mass position - Cell Cv is considered far from Cell Cw if the
size of the cell divided by the distance between
cells is less than a constantF - Our implementation (For each time-step)
- Create the octtree of cells
- Form a graph graph using the cells of the octtree
- Partition the graph, distribute cells to be
relocated among processors - Run the solver
12The Partitioning Graph
- Each vertex, v, in the partitioning graph
corresponds to a leaf cell, Cv with Cv bodies,
in the N-Body oct tree and has two associated
weights. PWgtv models computations associated
with the body, RWgtv represents data distribution
cost - PWgtv Cv x (Cv-1CloseBFarv2)
- RWgtv Cv
- Each edge (v,w) weight CWgt(v,w) models the
communication cost between cells Cv and Cw. - CWgt(v,w) cw if Cw is close to cw 0
otherwise.
13Graph Modifications
- METIS Limitations
- Cannot operate on directed graphs
- Cannot tolerate edge weights of zero
- N-Body graph
- CWgt(v,w) can be different than CWgt(w,v)
because Cv may not equal cw - CWgt(v,w) can equal 0 if Cv is close to cW but Cw
is far from Cv. - For direct comparisons, experiments are run using
- Original N-Body graph (Graph G)
- Modified Graph (Graph Gm)
14MinEX Basic Partition Criteria
- Minimize MaxQWgt rather than balance processor
workloads. - Collapse edges that result in the best Gain value
using a min-heap - Call user-defined latency tolerance function to
estimate latency tolerance - Move verticices from overloaded processors (QWgtp
gt WSysLL) to underloaded processors (QWgtp lt
WSysLL) - Reject potential reassignments that(i) have a
positive DVar - (ii) are rejected by the reassignment filter
function
15Reassignment Filter Function Goal Avoid
unnecessary edge processing and reject
deliterious reassignmnents that cause increased
edge processing
- IF (newQWgtfrom gt Qwgtfrom)
- Reject Assignment
- IF (newQWgtto lt Qwgtto)
- Reject Assignment
- IF (Dvar gt 0)
- Reject Assignment
- IF newGaingt0 newGain2/-DvargtThroTTle
- Reject Assignment
- DnewnewQWgtfrom-newQWgtto
- DoldQWgtfrom-QWgtto)
- IF fabs(Dnew)gtabs(Dnew)
- IF newQWgtfromltQwgtto
- Reject Assignment
- IF newQWgttogtQwgtfrom
- Reject Assignment
- Assignment Passes Filter
- Projects Qwgtnew, DVar, newGain
- Vertex totals used
- Edge weights same cluster
- Edge weights other clusters
- Local Edge weights
- Total outgoing edge weight
- Relocation, Processing weights
16Additional refinements (to enhance performance)
- Graph contraction phase
- Bucket sort vertices by process
- Quickly find candidates for merging
- Maintain a list of processors sorted by QWgt
- Few processors change position after vertex moves
- Maintaining this list incurs minimal overhead
- Defined user-defined latency tolerance function
(called before each potential reassignment) - Double MinEX(User user, Ipg ipg, Qtot tot)
- User User options passed to the partitioner
- Ipg Grid configuration graph
- tot contains Pprocp, Commp, Remapp, QLenp
17Experimental StudySimulation of a Grid
Environment
- Simulated Grid Environment vs actual grids
- Low cost alternative to constructing a wide range
heterogeneous configurations - Limited grid facilities are available in the
field and are usually homogeneous - Methodology
- Discrete time simulation
- Utilize configuration graph to model processing
speed, communication latency, and bandwidth - Configurations (Processors32,64,128
Interconnect slowdowns10,100Clusters4,8) - HO Constant processing and intra-communication
capabilityUP Faster processors have faster
intra-communication capability - DN Faster processors have slower
intra-communication capability
18Reassignment Filter Effectiveness
16K n-bodies 16K n-bodies 16K n-bodies 64K n-bodies 64K n-bodies 64K n-bodies 256K n-bodies 256K n-bodies 256K n-bodies
P Total Accept Fail Total Accept Fail Total Accept Fail
8 6011 110 0 14991 212 0 25183 222 0
128 19192 2562 0 49082 5240 4 51876 4608 1
1K 18555 2790 7 23986 6569 4 35606 12639 2
- Reassignment filter eliminates virtually all
overhead with vertex moves that are rejected - Almost all assignments passing the filter were
accepted
19Scalability Test (Scales well to 128
processors)P varied between 8 and 1024,
Runtimes compared
20ThroTTle Test (Initially Improves as throttle
increases until curve flattens out)
21Multiple Time Step TestP64, I10, C8, B16K
Single Iteration Single Iteration 50 Iterations 50 Iterations
Type RunTime LoadImb RunTIme LoadImb
MinEX-G 401 1.03 388 1.01
MinEX-Gm 413 1.05 398 1.02
METIS-Gm 1630 2.16 1534 2.03
- Running multiple iterations does not
significantly impact the results - The rest of the experiments will be based on a
single time step
22Partitioner Speed Comparisons
B Type P8 P16 P32 P64 P1h P2h P5h P1k
16K MinEX-G .17 .20 .23 .33 .53 1.09 1.58 2.36
MinEx-Gm .18 .20 .23 .32 .53 1.13 1.51 2.39
METIS-Gm .16 .23 .35 1.02 1.05 1.46 1.81 2.88
64K MinEX-G .31 .33 .40 .59 1.00 1.93 3.09 4.93
MinEx-Gm .35 .37 .39 .58 1.05 1.99 3.09 4.73
METIS-Gm .21 .22 .45 .60 1.55 1.82 2.32 3.42
256K MinEX-G .48 .53 .57 .71 1.08 2.27 5.37 9.08
MinEx-Gm .50 .55 .55 .69 1.08 2.30 5.88 9.17
METIS-Gm .43 .49 .59 .76 1.20 2.57 3.18 4.18
- MinEX has the advantage for P32 and P64
- METIS has the advantage for P1k
- Overall, MinEX is competitive
23Partition Quality Comparisons (C8)
- MinEX and METIS show similar results for
Homogeneous configurations. - Heterogeneous configurations show clear advantage
to MinEX
24Partition Quality Comparisons (C8)
- Similar results to I10 experiments
- MinEX-Gm results are in general somewhat worse
than MinEX-G because of less accurate application
modeling - METIS results are significantly worse than MinEX
but less compared to faster interconnects. Slower
interconnect speed makes grid more homogeneous
25Partition Quality ComparisonsAdditional
Observations
- DN configuration results are similar to UP
experiments with a few exceptions - DN runs are worse than the UP runs in a few
cases (998 vs 1489 if P128, C4, I100, B64K) - The MinEX projected 975, but converged to 1489.
- When Simulating a second input channel, the
solver converges at 975 for DN. No such
improvement for METIS - HO runs with P32 64, I100, B256K give METIS
an advantage (7399 to 5199 and 4231 and 3334
respectively). - MinEX is converging tightly (LoadImb1.0001) to
a high value - Perhaps the criteria for reassignments needs to
be further refined.
26Conclusions
- Direct comparisons between MinEX and METIS
- MinEX produces partitions that reduce runtime by
up to a factor of 6 in highly-heterogeneous grids - MinEX and METIS are competitive in homogeneous
grids - MinEX is competitive to METIS as far as speed of
execution - Implemented performance refinements to MinEX
- The reassignment filter minimizes overhead
associated with potential reassignments that are
rejected - Sorting processors by QWgt speed up partitioning
decisions - A bucket sort speeds up finding edges to collapse
- Minex can partition directed graphs
- Not commonly allowed by current partitioners
- Account for latency tolerance during partitioning
- Established the benefit and feasibility of this
approach - N-body solver implemention
- using the partitioning and message passing model.
27On-going Research
- MinEX Refinements
- Analyze effect of using multiple I/o channels and
network dynamics - Refine the method of selecting vertices for
reassignment - Refine the discrete time simulator
- Develop a general-purpose tool for simulating
heterogeneous grids - Establish the accuracy of the simulator by
comparing its projections to the performance of
applications running on real parallel systems