Title: Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications
1Latency Hiding in Dynamic Partitioning and Load
Balancing of Grid Computing Applications
- Sajal K. Das and Daniel J. Harvey
- Department of Computer Science and Engineering
- The University of Texas at Arlington
- E-mail das,harvey_at_cse.uta.edu
- Rupak Biswas
- NASA Ames Research Center
- E-mail rbiswas_at_nas.nasa.gov
2Presentation Overview
- The Information Power Grid (IPG)
- Motivations
- Load Balancing and Partitioning
- Our Contributions
- The new MinEX Partitioner
- Experimental Study
- Performance Results
- Conclusions and Ongoing Research
3The Information Power Grid (IPG)
- Harness the power of geographically separated
resources - Developed by NASA and other collaborative
partners - Utilize a distributed environment to solve
large-scale computational problems - Additional relevant applications identified by
I-Way experiment - Remote access to large databases with high-end
graphics facilities - Remote virtual reality access to instruments
- Remote interactions with supercomputer simulations
4Motivations
- Develop techniques to enhance the feasibility of
running applications on the IPG - Effective load-balancer/partitioner for a
distributed environment - Allow for latency tolerance to overcome low
bandwidths - Predict application performance by simulationof
IPG
5Load Balancing and Partitioning
- GOAL Distribute workload evenly among
processors - Static load balancers
- Balance load prior to execution
- Examples smart-compilers, schedulers
- Dynamic load balancers
- Balance as application is processed
- Examples adaptive contracting, gradient,
symmetric broadcast networks - Semi-dynamic load balancers
- Temporarily stop processing to balance workload
- Utilize a partitioning technique
- Examples MeTiS, Jostle, PLUM
6Our Contributions
- Limitations of existing partitioners
- Separate partitioning and data redistribution
steps - Lack of latency tolerance
- Balance loads with excessive communication and
data movement - Propose a new partitioner (MinEX) for IPG
environment - Minimize total runtime rather than balancing
workload - Compensate for high latency on the IPG
- Compare with existing methods
7The MinEX Partitioner
- Diffusive algorithm with goal to minimize total
runtime - User-supplied function for latency tolerance
- Account for data redistribution cost during
partitioning - Collapse pairs of vertices incrementally
- Partition the contracted graph
- Refine graph gradually to original in reverse
order - Vertex reassignment considered at each refinement
8Metrics Utilized
- Processing Weight
- Wgtv PWgtv x Procc
- Communication Cost
- Comm
- ???CWgt(v,w) x Connect(cp,cq)
- Redistribution Cost
- Remap
- RWgtv x Connect(Cp,Cq) if p q
- Weighted Queue Length
- QWgt(p) ??(Wgtv Comm Remap )
- Heaviest load (MaxQWgt)
- Lightest load (MinQWgt)
- Average load (AvgQWgt)
- Total system load QWgtToT ?QWgt(p)
- Load Imbalance Factor
- LoadImb MaxQWgt/AvgQWgt
v p
v p
v p
v p
9MinVar, Gain, and ThroTTle
- Processor workload variance from MinQWgt
- MinVar ?p(QWgt(p) - MinQWgt)2
- ?MinVar reflects the improvement in MinVar after
a vertex reassignment - Gain is the change(?QWgtToT) to total system load
resulting from a vertex reassignment - ThroTTle is a user defined parameter
- Vertex moves that improve ?MinVar are allowed if
Gain/Throttle lt ?MinVar
10MinEX Data Structures
- Mesh V, E, vTot, VMap, VList, EList
- V Number of active vertices
- E Total number of edges
- vTot Total number of vertices
- VMap Pointer to list of active vertices
- VList Pointer to complete list of vertices
- EList Pointer to list of edges
- EList entries contains w,CWgt(v,w)
- w adjacent vertex
- CWgt(v,w) edge communication weight
11MinEX Data Structures(continued)
- VList (for each vertex v) PWgt, RWgt, e, e,
merge, lookup, VMap, heap, border - PWgt Computational weight
- RWgt Redistribution weight
- e Number of incident edges
- e Pointer to the first edge
- merge Vertex that merged with v (or -1)
- lookup Active vertex containing v (or -1)
- VMap Pointer to vs position in VMap
- heap Pointer to heap entry for v
- border Indicates if v is a border vertex
12Minex Contraction Phase
- Form meta-verticesby collapsing edges
- Use maximalCWgt(v,w) / (RWgtvRWgtw)
- Procedure Find(v)If (merge -1) Return vIf
(lookup ! -1) And (lookup lt vTot) Then
Return lookup Find(lookup) Else Return
lookup Find(merge)
13MinEX Partition Phase
- Contracted graph allows efficient partitioning
- Heap with pointers is created
- For each vertex, compute optimal reassignment
- ?MinVar, Gain, and ThroTTle criteria satisfied
- Vertices are added to the Gain min-heap
- The VList heap pointer is set
- Heap is adjusted as vertices are reassigned
- Process stops when heap becomes empty
14MinEX Refinement Phase
- Refinement proceeds in reverse order from
contraction through popping vertex pairs off the
stack - Reassignment of each refined vertex
consideredand partitioning process restarted - Vertex lookup and merge values reset by following
the merge chain when edges are accessed(if
lookup gt vTot)
15Analysis of ThroTTle Values (P32)
- Expected MaxQWgt Varying ThroTTle
- Expected LoadImb Varying ThroTTle
ThroTTle Values
ThroTTle Values
16Latency Tolerance Approach
- Move data sets and edge data first
- Achieve latency tolerance by overlapping
processing with communication - Optimistic view Processing completely hides the
latency - Pessimistic view No latency hiding occurs
- Application passes to MinEX the latency hiding
function
- 1. Send data sets to be moved
- 2. Send edge data
- 3. Process vertices not waiting for edge
communication - 4. Receive, unpack remapped data sets
- 5. Receive, unpack communication data
- 6. Repeat steps 2-5 until all vertices are
processed
17Experimental StudySimulation of an IPG
Environment
- Configuration File defines clusters, processors,
and interconnect slowdowns - Processors in a cluster are assumed homogeneous
- Connect(c1, c2) interconnect slowdown
betweenclusters c1 and c2 (unity for no
slowdown) - If c1 c2, Connect(c1, c2) intraconnect
slowdown - Procc represents the processing slowdown
(normalized to unity) within a cluster - Configuration File mapped to processing graph by
MinEX so actual vertex assignments in the
distributed environment can be modeled
18Test ApplicationUnstructured Adaptive Mesh
- Time-dependent shock wave propagated thru
cylindrical volume - Tetrahedral mesh discretization
- Coarsen previously refined elements
- Mesh grows from 50K to 1.8M tets over nine
adaptation levels - Workload becomes unbalanced as mesh is adapted
19Characteristics Of Test Application
- Mesh elements interact only with immediate
neighbors - High communication and remapping costs
- Numerical solver not included
20MinEX Partitioner Performance
- SBN Dynamic load-balancer based on Symmetric
Broadcast Network that was adapted for mesh
applications - PLUM Semi-dynamic framework for processing
adaptive, unstructured meshes - MinEX comparisons with SBN and PLUM
21Experimental Results(P32)
- Expected runtimes(no latency tolerance)
- Expected runtimes (maximum latency tolerance)
INTERCONNECT SLOWDOWNS
INTERCONNECT SLOWDOWNS
Runtimes in thousands of units
22Conclusions Ongoing Research
- Introduced a new partitioner called MinEX and
experimented in simulated IPG environments - Runtimes increase with larger slowdowns as
clusters are added - Additional clusters increase benefits of latency
tolerance - Estimated runtimes with MinEX improved by a
factor of five over no partitioning - Currently applying MinEX to the N-body problem
(Barnes-Hut algorithm)