More on Adaptivity in Grids - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

More on Adaptivity in Grids

Description:

... work, it picks another processor at random and steals a jobs from its work queue ... Orphan owner marks it as stolen from the sender of the request ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 22
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: More on Adaptivity in Grids


1
More on Adaptivity in Grids
  • Sathish S. Vadhiyar
  • Source/Credits Figures from the referenced papers

2
Fault-Tolerance, Malleability and Migration for
Divide-and-Conquer Applications on the Grid
  • Wrzesinska et al.

3
Fault-Tolerance, Malleability and Migration for
Divide-and-Conquer Applications on the Grid
  • 3 general class of divisible applications
  • Master-worker paradigm 1 level
  • Hierarchical master-worker grid system 2 levels
  • Divide-and-conquer paradigm allows computation
    to be split up in a general way. E.g. search
    algorithms, ray tracing etc.
  • The work deals with mechanisms to deal with
    processors leaving
  • Handling partial results from leaving processors
  • Handling orphan work
  • 2 cases of processors leaving
  • When processors leave gracefully (e.g. when
    processor reservation comes to an end)
  • When processors crash
  • Restructuring computation tree

4
Introduction
  • Divide-and-conquer
  • Recursive subdivision After solving subproblems,
    their results are recursively combined until the
    final solution is reached.
  • Work is distributed across processors by
    work-stealing
  • When a processor runs out of work, it picks
    another processor at random and steals a jobs
    from its work queue
  • After computing the jobs, the result is returned
    to the originating processor
  • Have a work-stealing algorithm called CRS
    (Cluster-aware random stealing) that overlaps
    intra-cluster steals with inter-cluster steals

5
Malleability
  • Adding a new machine to a divide-and-conquer
    computation is simple
  • New machine starts stealing jobs from other
    machines
  • Leaving of a processor - Restructuring of the
    computation tree to reuse as many partial results
    as possible
  • What happens when processors leave
  • remaining processors are notified by leaving
    processor (when processors leave gracefully)
  • detected by the communication layer (in
    unexpected leaves)

6
Recomputing jobs stolen by leaving processors
  • Each processor maintains a list of jobs stolen
    from it and the processor Ids of the thieves
  • When processors leave
  • Each of the remaining processors traverses its
    stolen jobs list, searches for jobs stolen by
    leaving processors
  • Such jobs are put back in the work queues of
    owners, marked as restarted
  • Children of restarted jobs are also marked as
    restarted when they are spawned

7
Example
8
Example (Contd)
9
Orphan Jobs
  • Jobs stolen from leaving processors
  • Existing approaches
  • Processor working on an orphan job must discard
    the result, since it does not know where to
    return the result
  • Need to know the new address to return the result
  • Salvaging orphan jobs requires creating the link
    between the orphan and its restarted parent

10
Orphan Jobs (Contd)
  • For each finished orphan job
  • Broadcast of a small message containing the jobID
    of the orphan and the processorID that computed
    the orphan
  • Abort unfinished intermediate nodes of orphan
    subtrees
  • (jobID, processorID) stored by each processor in
    a local orphan table

11
Orphan Jobs (Contd)
  • When a processor tries to recompute restarted
    jobs
  • Processors perform lookup in orphan table
  • If the jobIDS match, the processor removes it
    from the workqueue, puts it in the list of stolen
    jobs
  • Send message to the orphan owner requesting
    result of the job
  • Orphan owner marks it as stolen from the sender
    of the request
  • Link between restarted parent and orphaned child
    is restored
  • Reusing orphans improves performance of the system

12
Example
13
Partial Results on Leaving Processors
  • If a processor knows it has to leave
  • Chooses another processor randomly
  • Transfers all results of finished jobs to the
    other processor
  • The jobs are treated as orphan jobs
  • Processor receiving the finished jobs broadcasts
    a (jobID, processorID) tuple
  • Partial results linked to the restarted parents

14
Special Cases
  • Master leaving special case owns root job that
    was not stolen from anyone
  • Remaining processors elect the new master which
    will respawn the root job
  • New run will reuse partial results of orphan jobs
    from previous run
  • Adding processors
  • New processor downloads an orphan table from one
    of the other processors
  • Piggybacks orphan table requests with steal
    requests
  • Message combining
  • One small (broadcast) message has to be sent for
    each orphan and for each computed job in the
    leaving processor
  • Messages are combined

15
Results
  • 3 Types
  • Overhead when no processors are leaving
  • Comparison with traditional approach that does
    not save orphans
  • To show that mechanism can be used for efficient
    migration of the computation
  • Testbeds
  • DAS-2 system, 5 clusters in five Dutch
    Universities
  • European GridLab 24 processors in 4 sites in
    Europe
  • 8 in Leiden and 8 in Delft (DAS-2)
  • 4 in Berlin
  • 4 in Brno

16
Overhead during normal Execution
  • 4 applications on a system with and without their
    mechanisms
  • RayTracer, TSP, SAT solver, Knapsack problem
  • Overhead is negligible

17
Impact of Salvaging Partial Results
  • RayTracer Application
  • 2 DAS-2 clusters with 16 processors each
  • Removed one cluster in the middle of the
    computation, after half of the time it would take
    on 2 clusters without processors leaving
  • Comparison of
  • Traditional approach (without saving partial
    results)
  • Recomputing trees when processors leave
    unexpectedly
  • Recomputing trees when processors leave
    gracefully
  • Runtime on 1.5 clusters (16 on processors in 1
    cluster and 8 processors in another cluster)
  • Difference between last two gives overhead of
    transferring the partial results from leaving
    processors and the work lost because of the
    leaving processors

18
Results
19
Migration
  • Replaced one cluster with another
  • Raytracer application on 3 clusters
  • In the middle of the computation, one cluster was
    gracefully removed, and another identical cluster
    added
  • Comparison without migration
  • Overhead of migration 2

20
References
  • Predicting the cost and benefit of adapting data
    parallel applications in clusters. Journal of
    Parallel and Distributed Computing. Volume 62 , 
    Issue 8  (August 2002) Pages 1248 - 1271   Year
    of Publication 2002 Author Jon B. Weissman
  • Fault-Tolerance, Malleability and Migration for
    Divide-and-Conquer Applications on the Grid,"
    Parallel and Distributed Processing Symposium,
    2005. Proceedings. 19th IEEE International ,
    vol., no.pp. 13a- 13a, 04-08 April 2005

21
Predicting the Cost and Benefit of Adapting Data
Parallel Applications in Clusters Jon Weissman
  • Library of adaptation techniques
  • Migration
  • Involves remote process creation followed by
    transmission of old workers data to new worker
  • Dynamic load balancing
  • Collecting load indices, determining
    redistribution and initiating data transmission
  • Addition or removal of processors
  • Followed by data transmission to maintain load
    balance
  • Library calls to detect and initiate adaptation
    actions within the applications
  • Adaptation event sent from an external detector
    to all workers
Write a Comment
User Comments (0)
About PowerShow.com