Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems - PowerPoint PPT Presentation

About This Presentation
Title:

Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems

Description:

Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 28
Provided by: Danie620
Learn more at: http://cs.sou.edu
Category:

less

Transcript and Presenter's Notes

Title: Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems


1
Designing an Efficient Partitioning Algorithm for
Grid Environments with Application to N-Body
Problems
  • Daniel J. Harvey
  • Department of Computer Science
  • Southern Oregon University
  • E-mail harveyd_at_sou.edu
  • Sajal K. Das
  • Department of Computer Science and Engineering
  • The University of Texas at Arlington
  • E-mail das_at_cse.uta.edu
  • Rupak Biswas
  • NASA Ames Research Center
  • E-mail rbiswas_at_nas.nasa.gov

2
Presentation Overview
  • The information power grid (IPG)
  • The MinEX partitioner
  • This papers contributions
  • Metrics utilized
  • The N-Body problem
  • MinEX refinements
  • Experimental study
  • Performance results
  • Conclusions and on-going research

3
The Information Power Grid (IPG)
  • Harness power of geographically separated
    resources
  • Developed by NASA and other collaborative
    partners
  • Utilize geographically separated processors to
    solve large-scale computational problems
  • Characteristics
  • limited bandwidth and high latency
  • heterogeneous configurations
  • Relevant applications identified by I-Way
    experiment
  • Remote access to large databases requiring
    high-end graphics
  • Remote virtual reality access to instruments
  • Remote interactions with super-computer
    simulations

4
Load Balancing Approaches Especially important
in grid environments
  • Traditional Load Balancing Objectives
  • Distribute workload evenly among processors
  • Minimize idle time
  • Static load-balancing
  • Balance load prior to execution
  • Examples smart-compilers, schedulers
  • Dynamic load-balancing
  • Balance as application is processed
  • Examples adaptive contracting, gradient,
    symmetric broadcast networks
  • Semi-dynamic load-balancing (Our focus in this
    paper)
  • Temporarily stop application processing to
    balance workload
  • Utilizes a partitioning technique
  • Examples MeTiS, Jostle, PLUM

5
The MinEX Partitioner
  • We previously introduced a novel partitioner
    called MinEX
  • Minex A latency-tolerant dynamic partitioner
    for grid computing applications, FGCS, 18 (2002),
    pp. 477489
  • MinEXs unique characterisitcs include
  • Environment designed specifically for
    heterogeneous geographically distributed
    environments
  • Grid maps configuration graph onto the partition
    graph produces partitions reflecting the grid
  • Goal minimize runtime rather than balance
    processing workload and minimize edge cut
  • Latency accounts for latency tolerance during
    partitioning
  • Accounts for data movement communication
    overhead

6
This Papers Contributions
  • To compare MinEX performance to METIS, a state
    the art partitioner
  • Result Speed of execution is competitive
  • Result Quality of partitions reduce application
    runtime by up to a factor of 6
  • Estimate performance utilizing a wide range of
    heterogeneous grid configurations
  • Apply MinEX to a real-life application (the
    N-Body problem) executing in simulated grid
    environments
  • Introduce refinements to our initial algorithm

7
The MinEX Partitioner
  • Multi-level scheme
  • Collapse edges incrementally
  • Partitions the contracted graph
  • Refines the graph in reverse
  • Reassignments executed to improve partition
    quality
  • Creates diffusive or from scratch partitions
  • User-supplied function estimates solver latency
    tolerance
  • Accounts for data redistribution cost during
    partitioning

8
Metrics Utilized
  • Processing weight
  • Wgt PWgtv x Procc
  • Communication cost
  • Comm
  • Swep CWgt(v,w) x Connect(c,d)
  • Redistribution cost
  • Remap
  • RWgtv x Connect(c,d) if p? q
  • Weighted queue length
  • QWgt(p) Svep (Wgt Comm Remap )
  • Heaviest load (MaxQWgt)
  • Qlenp Vertices e p
  • Average load (WSysLL)
  • Total system load QWgtToT SpePQWgt(p)
  • Imbalance factor
  • LoadImb MaxQWgt/WSysLL

v p
v p
v p
v p
9
MinVar, Gain andThroTTle
  • Processor workload variance from WSysLL
  • Var Sp(QWgt(p) - WSysLL)2
  • DVar reflects the improvement in MinVar after a
    vertex reassignment. A positive value implies
    that the Var value has increased
  • Gain is the change(DQWgtToT) to total system load
    resulting from a vertex reassignment
  • ThroTTle is a user defined parameter. If Gaingt0,
    Vertex moves that improve DVar are allowed if
    Gain2/-DVar lt ThroTTle

10
The N-Body Problem
  • Classical problem of simulating movement of a set
    of bodies
  • Based upon gravitational or electrostatic forces
  • Iterates over a series of time steps
  • At each step for each body
  • Compute forces from all other bodies using the
    gravitational laws
  • Calculates Acceleration and integrates twice to
    compute the position at the next time step
  • If all the force calculations are formed, O(n2)
    computations are required at each time step.

11
Barnes Hut Solution (Framework for experiments)
  • Reduces computational complexity from O(n2) to
    O(n lg n)
  • Clusters of bodies that are far from a cell are
    treated as a single body using the total center
    of mass and the center of mass position
  • Cell Cv is considered far from Cell Cw if the
    size of the cell divided by the distance between
    cells is less than a constantF
  • Our implementation (For each time-step)
  • Create the octtree of cells
  • Form a graph graph using the cells of the octtree
  • Partition the graph, distribute cells to be
    relocated among processors
  • Run the solver

12
The Partitioning Graph
  • Each vertex, v, in the partitioning graph
    corresponds to a leaf cell, Cv with Cv bodies,
    in the N-Body oct tree and has two associated
    weights. PWgtv models computations associated
    with the body, RWgtv represents data distribution
    cost
  • PWgtv Cv x (Cv-1CloseBFarv2)
  • RWgtv Cv
  • Each edge (v,w) weight CWgt(v,w) models the
    communication cost between cells Cv and Cw.
  • CWgt(v,w) cw if Cw is close to cw 0
    otherwise.

13
Graph Modifications
  • METIS Limitations
  • Cannot operate on directed graphs
  • Cannot tolerate edge weights of zero
  • N-Body graph
  • CWgt(v,w) can be different than CWgt(w,v)
    because Cv may not equal cw
  • CWgt(v,w) can equal 0 if Cv is close to cW but Cw
    is far from Cv.
  • For direct comparisons, experiments are run using
  • Original N-Body graph (Graph G)
  • Modified Graph (Graph Gm)

14
MinEX Basic Partition Criteria
  • Minimize MaxQWgt rather than balance processor
    workloads.
  • Collapse edges that result in the best Gain value
    using a min-heap
  • Call user-defined latency tolerance function to
    estimate latency tolerance
  • Move verticices from overloaded processors (QWgtp
    gt WSysLL) to underloaded processors (QWgtp lt
    WSysLL)
  • Reject potential reassignments that(i) have a
    positive DVar
  • (ii) are rejected by the reassignment filter
    function

15
Reassignment Filter Function Goal Avoid
unnecessary edge processing and reject
deliterious reassignmnents that cause increased
edge processing
  • IF (newQWgtfrom gt Qwgtfrom)
  • Reject Assignment
  • IF (newQWgtto lt Qwgtto)
  • Reject Assignment
  • IF (Dvar gt 0)
  • Reject Assignment
  • IF newGaingt0 newGain2/-DvargtThroTTle
  • Reject Assignment
  • DnewnewQWgtfrom-newQWgtto
  • DoldQWgtfrom-QWgtto)
  • IF fabs(Dnew)gtabs(Dnew)
  • IF newQWgtfromltQwgtto
  • Reject Assignment
  • IF newQWgttogtQwgtfrom
  • Reject Assignment
  • Assignment Passes Filter
  • Projects Qwgtnew, DVar, newGain
  • Vertex totals used
  • Edge weights same cluster
  • Edge weights other clusters
  • Local Edge weights
  • Total outgoing edge weight
  • Relocation, Processing weights

16
Additional refinements (to enhance performance)
  • Graph contraction phase
  • Bucket sort vertices by process
  • Quickly find candidates for merging
  • Maintain a list of processors sorted by QWgt
  • Few processors change position after vertex moves
  • Maintaining this list incurs minimal overhead
  • Defined user-defined latency tolerance function
    (called before each potential reassignment)
  • Double MinEX(User user, Ipg ipg, Qtot tot)
  • User User options passed to the partitioner
  • Ipg Grid configuration graph
  • tot contains Pprocp, Commp, Remapp, QLenp

17
Experimental StudySimulation of a Grid
Environment
  • Simulated Grid Environment vs actual grids
  • Low cost alternative to constructing a wide range
    heterogeneous configurations
  • Limited grid facilities are available in the
    field and are usually homogeneous
  • Methodology
  • Discrete time simulation
  • Utilize configuration graph to model processing
    speed, communication latency, and bandwidth
  • Configurations (Processors32,64,128
    Interconnect slowdowns10,100Clusters4,8)
  • HO Constant processing and intra-communication
    capabilityUP Faster processors have faster
    intra-communication capability
  • DN Faster processors have slower
    intra-communication capability

18
Reassignment Filter Effectiveness
16K n-bodies 16K n-bodies 16K n-bodies 64K n-bodies 64K n-bodies 64K n-bodies 256K n-bodies 256K n-bodies 256K n-bodies
P Total Accept Fail Total Accept Fail Total Accept Fail
8 6011 110 0 14991 212 0 25183 222 0
128 19192 2562 0 49082 5240 4 51876 4608 1
1K 18555 2790 7 23986 6569 4 35606 12639 2
  • Reassignment filter eliminates virtually all
    overhead with vertex moves that are rejected
  • Almost all assignments passing the filter were
    accepted

19
Scalability Test (Scales well to 128
processors)P varied between 8 and 1024,
Runtimes compared
20
ThroTTle Test (Initially Improves as throttle
increases until curve flattens out)
21
Multiple Time Step TestP64, I10, C8, B16K
Single Iteration Single Iteration 50 Iterations 50 Iterations
Type RunTime LoadImb RunTIme LoadImb
MinEX-G 401 1.03 388 1.01
MinEX-Gm 413 1.05 398 1.02
METIS-Gm 1630 2.16 1534 2.03
  • Running multiple iterations does not
    significantly impact the results
  • The rest of the experiments will be based on a
    single time step

22
Partitioner Speed Comparisons
B Type P8 P16 P32 P64 P1h P2h P5h P1k
16K MinEX-G .17 .20 .23 .33 .53 1.09 1.58 2.36
MinEx-Gm .18 .20 .23 .32 .53 1.13 1.51 2.39
METIS-Gm .16 .23 .35 1.02 1.05 1.46 1.81 2.88
64K MinEX-G .31 .33 .40 .59 1.00 1.93 3.09 4.93
MinEx-Gm .35 .37 .39 .58 1.05 1.99 3.09 4.73
METIS-Gm .21 .22 .45 .60 1.55 1.82 2.32 3.42
256K MinEX-G .48 .53 .57 .71 1.08 2.27 5.37 9.08
MinEx-Gm .50 .55 .55 .69 1.08 2.30 5.88 9.17
METIS-Gm .43 .49 .59 .76 1.20 2.57 3.18 4.18
  • MinEX has the advantage for P32 and P64
  • METIS has the advantage for P1k
  • Overall, MinEX is competitive

23
Partition Quality Comparisons (C8)
  • MinEX and METIS show similar results for
    Homogeneous configurations.
  • Heterogeneous configurations show clear advantage
    to MinEX

24
Partition Quality Comparisons (C8)
  • Similar results to I10 experiments
  • MinEX-Gm results are in general somewhat worse
    than MinEX-G because of less accurate application
    modeling
  • METIS results are significantly worse than MinEX
    but less compared to faster interconnects. Slower
    interconnect speed makes grid more homogeneous

25
Partition Quality ComparisonsAdditional
Observations
  • DN configuration results are similar to UP
    experiments with a few exceptions
  • DN runs are worse than the UP runs in a few
    cases (998 vs 1489 if P128, C4, I100, B64K)
  • The MinEX projected 975, but converged to 1489.
  • When Simulating a second input channel, the
    solver converges at 975 for DN. No such
    improvement for METIS
  • HO runs with P32 64, I100, B256K give METIS
    an advantage (7399 to 5199 and 4231 and 3334
    respectively).
  • MinEX is converging tightly (LoadImb1.0001) to
    a high value
  • Perhaps the criteria for reassignments needs to
    be further refined.

26
Conclusions
  • Direct comparisons between MinEX and METIS
  • MinEX produces partitions that reduce runtime by
    up to a factor of 6 in highly-heterogeneous grids
  • MinEX and METIS are competitive in homogeneous
    grids
  • MinEX is competitive to METIS as far as speed of
    execution
  • Implemented performance refinements to MinEX
  • The reassignment filter minimizes overhead
    associated with potential reassignments that are
    rejected
  • Sorting processors by QWgt speed up partitioning
    decisions
  • A bucket sort speeds up finding edges to collapse
  • Minex can partition directed graphs
  • Not commonly allowed by current partitioners
  • Account for latency tolerance during partitioning
  • Established the benefit and feasibility of this
    approach
  • N-body solver implemention
  • using the partitioning and message passing model.

27
On-going Research
  • MinEX Refinements
  • Analyze effect of using multiple I/o channels and
    network dynamics
  • Refine the method of selecting vertices for
    reassignment
  • Refine the discrete time simulator
  • Develop a general-purpose tool for simulating
    heterogeneous grids
  • Establish the accuracy of the simulator by
    comparing its projections to the performance of
    applications running on real parallel systems
Write a Comment
User Comments (0)
About PowerShow.com