Title: Checkpointing Mechanism for the Grid Environment
1Checkpointing Mechanism for the Grid Environment
- K Sajadah, G Terstyanszky, S Winter, P. Kacsuk
- University of Westminster
2The Grid Environment
- Nature of Grid Environment
- Generic, heterogeneous, and dynamic with lots of
unreliable resources making it exposed to
failures. - Solution
- Fault tolerant mechanisms should ensure
successful execution of applications.
3Fault Tolerant Solutions
- Retrying
- When a job fails, it is re-executed a certain
number of times. - The expected jobs completion time is very big.
- Replication
- Replicas of a job are executed on different Grid
resources simultaneously. - It requires extra processing power.
- Checkpointing
- It stores a snapshot of an application state, and
use it for restarting the execution in case of
failure. - It is very efficient in environment where failure
rate is high.
4Checkpointing
- Transparent Checkpointing
- Programmer orchestrates the checkpointing process
- Message synchronisation is performed.
- Checkpointing Recovery process is transparent
to the programmer. - Non-Transparent Checkpointing
- Mechanism provides support for checkpointing
through run-time libraries. - Programmer can specify data that should be
included in checkpoint file. - Approach is not transparent to the programmer.
5Challenges in Checkpointing
- When to take the checkpoint
- How to synchronise (or how to minimise
inter-process communication) - What kind of info to store at the checkpoint
- Where to store the checkpoints info
- How to restore the execution after a fault
6Checkpointing (2)
- Performance constraints in existing solutions
- Overheads due to synchronisation of messages.
- Checkpoint intervals are either user-defined with
no regular pattern or are periodic. - Proposed solution
- Take checkpoint at the best possible pre-defined
intervals. - Mimimalise (or optimise) the inter-communication
as much as possible.
7Checkpointing (3)
- Inter-process communications can cause
inconsistent checkpoints due to lost messages or
orphan messages. - To achieve a global consistent checkpoint
synchronization should be performed - Synchronization introduces extra communications
among processes.
8Approaches Used
- Combination of
- First Order Approximation.
- Natural Synchronisation Points.
- First Order Approximation
- Calculate the optimal checkpointing intervals.
- Based on the Poisson process.
- Occurrence of failure is random with failure rate
?.
9First Order Approximation
- The Optimal Checkpoint interval Tc is
- Tc ?2TsTf , where
- Ts is the time required to save information at a
checkpoint. - Tf is the mean time between failures and Tf
Th/?k - The following data are needed
- The number of hours the program will run on the
machines (Th). - The known failure rate during that time (?k).
- The time required to save information at a
checkpoint (Ts).
10First Order Approximation (2)
Tc Checkpoint interval Ts Time to save a
checkpoint tr Rerun time of a failed
application
11First Order Approximation(3)
- Using the PROVE toolset, we can measure both the
execution time and the checkpointing time of an
application. - Nagios can be used to determine the failure rate
of Grid resources.
12Natural Synchronisation Points
- Examples of natural synchronization points
- Barriers.
- Top or bottom of a main loop.
- Collective operations (broadcast, gather,
scatter, etc.) - No interprocess communication at these points.
- Therefore, no need to be concerned with the state
of the communication channels or possible
in-transit message. - Eliminate the overhead incurred due to the
synchronization process involved during
checkpointing.
13Natural Synchronisation Points (2)
Application Execution with Processes interacting
Coordinated checkpoint - waiting for in-transit
messages
14Natural Synchronisation Points (4)
Coordinated checkpoint - logging in-transit
messages
Checkpointing at natural synchronisation points.
15New Checkpointing Approa
- Using First Order Approximation only
- Involves synchronisation of messages and
capturing in-transit messages. - Checkpointing at natural synchronisation points
only - May not be very effective because there are no
patterns in their occurrences.
16New Checkpointing Approach(2)
- Use a combination of both the Natural
Synchronisation Points and the First Order
Approximation. - Take checkpoints at natural synchronization
points which are closest to the optimal
checkpoint intervals.
17Choosing Checkpoint Intervals
Choosing appropriate checkpointing intervals
18Choosing Checkpoint Intervals(2)
- Decision to select a checkpoint based on
- Optimal checkpoint interval,
- Natural synchronisation points and
- Critical Region.
- Checkpointing process is triggered by signals
sent to the coordinated process whenever
synchronization points are encountered.
19The Checkpointing Process
- When coordinated process receives a signal, it
checks to see if this signal is within the
critical region. - If so, a checkpoint is taken and the clock is
reset. - If not, no checkpointing is performed.
- If no natural synchronization points are met
within the critical region, we will have to force
a checkpoint at the end of the critical region. - In such cases, the checkpointing mechanism will
perform synchronization to ensure there are no
lost or orphan messages.
20The TestBed
- Madcity Traffic Simulation tool was used.
- Simulates traffic on a road network and shows how
individual vehicles behave on roads and at
junctions. - MadCity traffic simulator can be parallelised
using PGRADE.
21The Testbed(2)
Proposed checkpointing solution
22The Testbed(3)
- Through the First Order Approximation, the
calculated optimal checkpoint interval was 8
minutes. - A critical region of 2 minutes range from the
optimal checkpoint interval was defined. - Checkpoint taken at Ns1, Ns2, Ns5, Fs1, Ns6,Ns9.
- Overall average time between checkpoints 8.2
minutes
23Conclusion
- Proposed checkpointing mechanism provides a
better and more efficient way to save checkpoint
images. - Minimise the need of performing synchronisation
of messages. - Ensure that our average checkpointing interval is
close to the optimal checkpointing interval
defined by the First Order Approximation.
24Future Works
- Integrate the checkpointing solution in PGRADE to
provide an efficient fault tolerant solution to
applications executed as Grid workflows. - Provide an efficient and reliable storage
mechanism.
25Questions