Title: More on Adaptivity in Grids
1More on Adaptivity in Grids
- Sathish S. Vadhiyar
- Source/Credits Figures from the referenced papers
2Fault-Tolerance, Malleability and Migration for
Divide-and-Conquer Applications on the Grid
3Fault-Tolerance, Malleability and Migration for
Divide-and-Conquer Applications on the Grid
- 3 general class of divisible applications
- Master-worker paradigm 1 level
- Hierarchical master-worker grid system 2 levels
- Divide-and-conquer paradigm allows computation
to be split up in a general way. E.g. search
algorithms, ray tracing etc. - The work deals with mechanisms to deal with
processors leaving - Handling partial results from leaving processors
- Handling orphan work
- 2 cases of processors leaving
- When processors leave gracefully (e.g. when
processor reservation comes to an end) - When processors crash
- Restructuring computation tree
4Introduction
- Divide-and-conquer
- Recursive subdivision After solving subproblems,
their results are recursively combined until the
final solution is reached. - Work is distributed across processors by
work-stealing - When a processor runs out of work, it picks
another processor at random and steals a jobs
from its work queue - After computing the jobs, the result is returned
to the originating processor - Have a work-stealing algorithm called CRS
(Cluster-aware random stealing) that overlaps
intra-cluster steals with inter-cluster steals
5Malleability
- Adding a new machine to a divide-and-conquer
computation is simple - New machine starts stealing jobs from other
machines - Leaving of a processor - Restructuring of the
computation tree to reuse as many partial results
as possible - What happens when processors leave
- remaining processors are notified by leaving
processor (when processors leave gracefully) - detected by the communication layer (in
unexpected leaves)
6Recomputing jobs stolen by leaving processors
- Each processor maintains a list of jobs stolen
from it and the processor Ids of the thieves - When processors leave
- Each of the remaining processors traverses its
stolen jobs list, searches for jobs stolen by
leaving processors - Such jobs are put back in the work queues of
owners, marked as restarted - Children of restarted jobs are also marked as
restarted when they are spawned
7Example
8Example (Contd)
9Orphan Jobs
- Jobs stolen from leaving processors
- Existing approaches
- Processor working on an orphan job must discard
the result, since it does not know where to
return the result - Need to know the new address to return the result
- Salvaging orphan jobs requires creating the link
between the orphan and its restarted parent
10Orphan Jobs (Contd)
- For each finished orphan job
- Broadcast of a small message containing the jobID
of the orphan and the processorID that computed
the orphan - Abort unfinished intermediate nodes of orphan
subtrees - (jobID, processorID) stored by each processor in
a local orphan table
11Orphan Jobs (Contd)
- When a processor tries to recompute restarted
jobs - Processors perform lookup in orphan table
- If the jobIDS match, the processor removes it
from the workqueue, puts it in the list of stolen
jobs - Send message to the orphan owner requesting
result of the job - Orphan owner marks it as stolen from the sender
of the request - Link between restarted parent and orphaned child
is restored - Reusing orphans improves performance of the system
12Example
13Partial Results on Leaving Processors
- If a processor knows it has to leave
- Chooses another processor randomly
- Transfers all results of finished jobs to the
other processor - The jobs are treated as orphan jobs
- Processor receiving the finished jobs broadcasts
a (jobID, processorID) tuple - Partial results linked to the restarted parents
14Special Cases
- Master leaving special case owns root job that
was not stolen from anyone - Remaining processors elect the new master which
will respawn the root job - New run will reuse partial results of orphan jobs
from previous run - Adding processors
- New processor downloads an orphan table from one
of the other processors - Piggybacks orphan table requests with steal
requests - Message combining
- One small (broadcast) message has to be sent for
each orphan and for each computed job in the
leaving processor - Messages are combined
15Results
- 3 Types
- Overhead when no processors are leaving
- Comparison with traditional approach that does
not save orphans - To show that mechanism can be used for efficient
migration of the computation - Testbeds
- DAS-2 system, 5 clusters in five Dutch
Universities - European GridLab 24 processors in 4 sites in
Europe - 8 in Leiden and 8 in Delft (DAS-2)
- 4 in Berlin
- 4 in Brno
16Overhead during normal Execution
- 4 applications on a system with and without their
mechanisms - RayTracer, TSP, SAT solver, Knapsack problem
- Overhead is negligible
17Impact of Salvaging Partial Results
- RayTracer Application
- 2 DAS-2 clusters with 16 processors each
- Removed one cluster in the middle of the
computation, after half of the time it would take
on 2 clusters without processors leaving - Comparison of
- Traditional approach (without saving partial
results) - Recomputing trees when processors leave
unexpectedly - Recomputing trees when processors leave
gracefully - Runtime on 1.5 clusters (16 on processors in 1
cluster and 8 processors in another cluster) - Difference between last two gives overhead of
transferring the partial results from leaving
processors and the work lost because of the
leaving processors
18Results
19Migration
- Replaced one cluster with another
- Raytracer application on 3 clusters
- In the middle of the computation, one cluster was
gracefully removed, and another identical cluster
added - Comparison without migration
- Overhead of migration 2
20References
- Predicting the cost and benefit of adapting data
parallel applications in clusters. Journal of
Parallel and Distributed Computing. Volume 62 ,Â
Issue 8 Â (August 2002) Pages 1248 - 1271Â Â Year
of Publication 2002 Author Jon B. Weissman - Fault-Tolerance, Malleability and Migration for
Divide-and-Conquer Applications on the Grid,"
Parallel and Distributed Processing Symposium,
2005. Proceedings. 19th IEEE International ,
vol., no.pp. 13a- 13a, 04-08 April 2005
21Predicting the Cost and Benefit of Adapting Data
Parallel Applications in Clusters Jon Weissman
- Library of adaptation techniques
- Migration
- Involves remote process creation followed by
transmission of old workers data to new worker - Dynamic load balancing
- Collecting load indices, determining
redistribution and initiating data transmission - Addition or removal of processors
- Followed by data transmission to maintain load
balance - Library calls to detect and initiate adaptation
actions within the applications - Adaptation event sent from an external detector
to all workers