Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems - PowerPoint PPT Presentation

About This Presentation

Title:

Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems

Description:

Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems – PowerPoint PPT presentation

Number of Views:205

Avg rating:3.0/5.0

Slides: 28

Provided by: Danie620

Learn more at: http://cs.sou.edu

Category:

more less

Transcript and Presenter's Notes

Title: Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems

1
Designing an Efficient Partitioning Algorithm for
Grid Environments with Application to N-Body
Problems

Daniel J. Harvey
Department of Computer Science
Southern Oregon University
E-mail harveyd_at_sou.edu
Sajal K. Das
Department of Computer Science and Engineering
The University of Texas at Arlington
E-mail das_at_cse.uta.edu
Rupak Biswas
NASA Ames Research Center
E-mail rbiswas_at_nas.nasa.gov

2
Presentation Overview

The information power grid (IPG)
The MinEX partitioner
This papers contributions
Metrics utilized
The N-Body problem
MinEX refinements
Experimental study
Performance results
Conclusions and on-going research

3
The Information Power Grid (IPG)

Harness power of geographically separated
resources
Developed by NASA and other collaborative
partners
Utilize geographically separated processors to
solve large-scale computational problems
Characteristics
limited bandwidth and high latency
heterogeneous configurations
Relevant applications identified by I-Way
experiment
Remote access to large databases requiring
high-end graphics
Remote virtual reality access to instruments
Remote interactions with super-computer
simulations

4
Load Balancing Approaches Especially important
in grid environments

Traditional Load Balancing Objectives
Distribute workload evenly among processors
Minimize idle time
Static load-balancing
Balance load prior to execution
Examples smart-compilers, schedulers
Dynamic load-balancing
Balance as application is processed
Examples adaptive contracting, gradient,
symmetric broadcast networks
Semi-dynamic load-balancing (Our focus in this
paper)
Temporarily stop application processing to
balance workload
Utilizes a partitioning technique
Examples MeTiS, Jostle, PLUM

5
The MinEX Partitioner

We previously introduced a novel partitioner
called MinEX
Minex A latency-tolerant dynamic partitioner
for grid computing applications, FGCS, 18 (2002),
pp. 477489
MinEXs unique characterisitcs include
Environment designed specifically for
heterogeneous geographically distributed
environments
Grid maps configuration graph onto the partition
graph produces partitions reflecting the grid
Goal minimize runtime rather than balance
processing workload and minimize edge cut
Latency accounts for latency tolerance during
partitioning
Accounts for data movement communication
overhead

6
This Papers Contributions

To compare MinEX performance to METIS, a state
the art partitioner
Result Speed of execution is competitive
Result Quality of partitions reduce application
runtime by up to a factor of 6
Estimate performance utilizing a wide range of
heterogeneous grid configurations
Apply MinEX to a real-life application (the
N-Body problem) executing in simulated grid
environments
Introduce refinements to our initial algorithm

7
The MinEX Partitioner

Multi-level scheme
Collapse edges incrementally
Partitions the contracted graph
Refines the graph in reverse
Reassignments executed to improve partition
quality
Creates diffusive or from scratch partitions
User-supplied function estimates solver latency
tolerance
Accounts for data redistribution cost during
partitioning

8
Metrics Utilized

Processing weight
Wgt PWgtv x Procc
Communication cost
Comm
Swep CWgt(v,w) x Connect(c,d)
Redistribution cost
Remap
RWgtv x Connect(c,d) if p? q
Weighted queue length
QWgt(p) Svep (Wgt Comm Remap )

Heaviest load (MaxQWgt)
Qlenp Vertices e p
Average load (WSysLL)
Total system load QWgtToT SpePQWgt(p)
Imbalance factor
LoadImb MaxQWgt/WSysLL

v p
v p
v p
v p
9
MinVar, Gain andThroTTle

Processor workload variance from WSysLL
Var Sp(QWgt(p) - WSysLL)2
DVar reflects the improvement in MinVar after a
vertex reassignment. A positive value implies
that the Var value has increased
Gain is the change(DQWgtToT) to total system load
resulting from a vertex reassignment
ThroTTle is a user defined parameter. If Gaingt0,
Vertex moves that improve DVar are allowed if
Gain2/-DVar lt ThroTTle

10
The N-Body Problem

Classical problem of simulating movement of a set
of bodies
Based upon gravitational or electrostatic forces
Iterates over a series of time steps
At each step for each body
Compute forces from all other bodies using the
gravitational laws
Calculates Acceleration and integrates twice to
compute the position at the next time step
If all the force calculations are formed, O(n2)
computations are required at each time step.

11
Barnes Hut Solution (Framework for experiments)

Reduces computational complexity from O(n2) to
O(n lg n)
Clusters of bodies that are far from a cell are
treated as a single body using the total center
of mass and the center of mass position
Cell Cv is considered far from Cell Cw if the
size of the cell divided by the distance between
cells is less than a constantF
Our implementation (For each time-step)
Create the octtree of cells
Form a graph graph using the cells of the octtree
Partition the graph, distribute cells to be
relocated among processors
Run the solver

12
The Partitioning Graph

Each vertex, v, in the partitioning graph
corresponds to a leaf cell, Cv with Cv bodies,
in the N-Body oct tree and has two associated
weights. PWgtv models computations associated
with the body, RWgtv represents data distribution
cost
PWgtv Cv x (Cv-1CloseBFarv2)
RWgtv Cv
Each edge (v,w) weight CWgt(v,w) models the
communication cost between cells Cv and Cw.
CWgt(v,w) cw if Cw is close to cw 0
otherwise.

13
Graph Modifications

METIS Limitations
Cannot operate on directed graphs
Cannot tolerate edge weights of zero
N-Body graph
CWgt(v,w) can be different than CWgt(w,v)
because Cv may not equal cw
CWgt(v,w) can equal 0 if Cv is close to cW but Cw
is far from Cv.
For direct comparisons, experiments are run using
Original N-Body graph (Graph G)
Modified Graph (Graph Gm)

14
MinEX Basic Partition Criteria

Minimize MaxQWgt rather than balance processor
workloads.
Collapse edges that result in the best Gain value
using a min-heap
Call user-defined latency tolerance function to
estimate latency tolerance
Move verticices from overloaded processors (QWgtp
gt WSysLL) to underloaded processors (QWgtp lt
WSysLL)
Reject potential reassignments that(i) have a
positive DVar
(ii) are rejected by the reassignment filter
function

15
Reassignment Filter Function Goal Avoid
unnecessary edge processing and reject
deliterious reassignmnents that cause increased
edge processing

IF (newQWgtfrom gt Qwgtfrom)
Reject Assignment
IF (newQWgtto lt Qwgtto)
Reject Assignment
IF (Dvar gt 0)
Reject Assignment
IF newGaingt0 newGain2/-DvargtThroTTle
Reject Assignment
DnewnewQWgtfrom-newQWgtto
DoldQWgtfrom-QWgtto)
IF fabs(Dnew)gtabs(Dnew)
IF newQWgtfromltQwgtto
Reject Assignment
IF newQWgttogtQwgtfrom
Reject Assignment
Assignment Passes Filter

Projects Qwgtnew, DVar, newGain
Vertex totals used
Edge weights same cluster
Edge weights other clusters
Local Edge weights
Total outgoing edge weight
Relocation, Processing weights

16
Additional refinements (to enhance performance)

Graph contraction phase
Bucket sort vertices by process
Quickly find candidates for merging
Maintain a list of processors sorted by QWgt
Few processors change position after vertex moves
Maintaining this list incurs minimal overhead
Defined user-defined latency tolerance function
(called before each potential reassignment)
Double MinEX(User user, Ipg ipg, Qtot tot)
User User options passed to the partitioner
Ipg Grid configuration graph
tot contains Pprocp, Commp, Remapp, QLenp

17
Experimental StudySimulation of a Grid
Environment

Simulated Grid Environment vs actual grids
Low cost alternative to constructing a wide range
heterogeneous configurations
Limited grid facilities are available in the
field and are usually homogeneous
Methodology
Discrete time simulation
Utilize configuration graph to model processing
speed, communication latency, and bandwidth
Configurations (Processors32,64,128
Interconnect slowdowns10,100Clusters4,8)
HO Constant processing and intra-communication
capabilityUP Faster processors have faster
intra-communication capability
DN Faster processors have slower
intra-communication capability

18
Reassignment Filter Effectiveness
16K n-bodies 16K n-bodies 16K n-bodies 64K n-bodies 64K n-bodies 64K n-bodies 256K n-bodies 256K n-bodies 256K n-bodies
P Total Accept Fail Total Accept Fail Total Accept Fail
8 6011 110 0 14991 212 0 25183 222 0
128 19192 2562 0 49082 5240 4 51876 4608 1
1K 18555 2790 7 23986 6569 4 35606 12639 2

Reassignment filter eliminates virtually all
overhead with vertex moves that are rejected
Almost all assignments passing the filter were
accepted

19
Scalability Test (Scales well to 128
processors)P varied between 8 and 1024,
Runtimes compared
20
ThroTTle Test (Initially Improves as throttle
increases until curve flattens out)
21
Multiple Time Step TestP64, I10, C8, B16K
Single Iteration Single Iteration 50 Iterations 50 Iterations
Type RunTime LoadImb RunTIme LoadImb
MinEX-G 401 1.03 388 1.01
MinEX-Gm 413 1.05 398 1.02
METIS-Gm 1630 2.16 1534 2.03

Running multiple iterations does not
significantly impact the results
The rest of the experiments will be based on a
single time step

22
Partitioner Speed Comparisons
B Type P8 P16 P32 P64 P1h P2h P5h P1k
16K MinEX-G .17 .20 .23 .33 .53 1.09 1.58 2.36
MinEx-Gm .18 .20 .23 .32 .53 1.13 1.51 2.39
METIS-Gm .16 .23 .35 1.02 1.05 1.46 1.81 2.88
64K MinEX-G .31 .33 .40 .59 1.00 1.93 3.09 4.93
MinEx-Gm .35 .37 .39 .58 1.05 1.99 3.09 4.73
METIS-Gm .21 .22 .45 .60 1.55 1.82 2.32 3.42
256K MinEX-G .48 .53 .57 .71 1.08 2.27 5.37 9.08
MinEx-Gm .50 .55 .55 .69 1.08 2.30 5.88 9.17
METIS-Gm .43 .49 .59 .76 1.20 2.57 3.18 4.18

MinEX has the advantage for P32 and P64
METIS has the advantage for P1k
Overall, MinEX is competitive

23
Partition Quality Comparisons (C8)

MinEX and METIS show similar results for
Homogeneous configurations.
Heterogeneous configurations show clear advantage
to MinEX

24
Partition Quality Comparisons (C8)

Similar results to I10 experiments
MinEX-Gm results are in general somewhat worse
than MinEX-G because of less accurate application
modeling
METIS results are significantly worse than MinEX
but less compared to faster interconnects. Slower
interconnect speed makes grid more homogeneous

25
Partition Quality ComparisonsAdditional
Observations

DN configuration results are similar to UP
experiments with a few exceptions
DN runs are worse than the UP runs in a few
cases (998 vs 1489 if P128, C4, I100, B64K)
The MinEX projected 975, but converged to 1489.
When Simulating a second input channel, the
solver converges at 975 for DN. No such
improvement for METIS
HO runs with P32 64, I100, B256K give METIS
an advantage (7399 to 5199 and 4231 and 3334
respectively).
MinEX is converging tightly (LoadImb1.0001) to
a high value
Perhaps the criteria for reassignments needs to
be further refined.

26
Conclusions

Direct comparisons between MinEX and METIS
MinEX produces partitions that reduce runtime by
up to a factor of 6 in highly-heterogeneous grids
MinEX and METIS are competitive in homogeneous
grids
MinEX is competitive to METIS as far as speed of
execution
Implemented performance refinements to MinEX
The reassignment filter minimizes overhead
associated with potential reassignments that are
rejected
Sorting processors by QWgt speed up partitioning
decisions
A bucket sort speeds up finding edges to collapse
Minex can partition directed graphs
Not commonly allowed by current partitioners
Account for latency tolerance during partitioning
Established the benefit and feasibility of this
approach
N-body solver implemention
using the partitioning and message passing model.

27
On-going Research

MinEX Refinements
Analyze effect of using multiple I/o channels and
network dynamics
Refine the method of selecting vertices for
reassignment
Refine the discrete time simulator
Develop a general-purpose tool for simulating
heterogeneous grids
Establish the accuracy of the simulator by
comparing its projections to the performance of
applications running on real parallel systems