Checkpointing Mechanism for the Grid Environment - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Checkpointing Mechanism for the Grid Environment

Description:

Based on the Poisson process. Occurrence of failure is random with failure rate ... Ts is the time required to save information at a checkpoint. ... Critical Region. ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 26

Provided by: krit8

Category:

more less

Transcript and Presenter's Notes

Title: Checkpointing Mechanism for the Grid Environment

1
Checkpointing Mechanism for the Grid Environment

K Sajadah, G Terstyanszky, S Winter, P. Kacsuk
University of Westminster

2
The Grid Environment

Nature of Grid Environment
Generic, heterogeneous, and dynamic with lots of
unreliable resources making it exposed to
failures.
Solution
Fault tolerant mechanisms should ensure
successful execution of applications.

3
Fault Tolerant Solutions

Retrying
When a job fails, it is re-executed a certain
number of times.
The expected jobs completion time is very big.
Replication
Replicas of a job are executed on different Grid
resources simultaneously.
It requires extra processing power.
Checkpointing
It stores a snapshot of an application state, and
use it for restarting the execution in case of
failure.
It is very efficient in environment where failure
rate is high.

4
Checkpointing

Transparent Checkpointing
Programmer orchestrates the checkpointing process
Message synchronisation is performed.
Checkpointing Recovery process is transparent
to the programmer.
Non-Transparent Checkpointing
Mechanism provides support for checkpointing
through run-time libraries.
Programmer can specify data that should be
included in checkpoint file.
Approach is not transparent to the programmer.

5
Challenges in Checkpointing

When to take the checkpoint
How to synchronise (or how to minimise
inter-process communication)
What kind of info to store at the checkpoint
Where to store the checkpoints info
How to restore the execution after a fault

6
Checkpointing (2)

Performance constraints in existing solutions
Overheads due to synchronisation of messages.
Checkpoint intervals are either user-defined with
no regular pattern or are periodic.
Proposed solution
Take checkpoint at the best possible pre-defined
intervals.
Mimimalise (or optimise) the inter-communication
as much as possible.

7
Checkpointing (3)

Inter-process communications can cause
inconsistent checkpoints due to lost messages or
orphan messages.
To achieve a global consistent checkpoint
synchronization should be performed
Synchronization introduces extra communications
among processes.

8
Approaches Used

Combination of
First Order Approximation.
Natural Synchronisation Points.
First Order Approximation
Calculate the optimal checkpointing intervals.
Based on the Poisson process.
Occurrence of failure is random with failure rate
?.

9
First Order Approximation

The Optimal Checkpoint interval Tc is
Tc ?2TsTf , where
Ts is the time required to save information at a
checkpoint.
Tf is the mean time between failures and Tf
Th/?k
The following data are needed
The number of hours the program will run on the
machines (Th).
The known failure rate during that time (?k).
The time required to save information at a
checkpoint (Ts).

10
First Order Approximation (2)
Tc Checkpoint interval Ts Time to save a
checkpoint tr Rerun time of a failed
application
11
First Order Approximation(3)

Using the PROVE toolset, we can measure both the
execution time and the checkpointing time of an
application.
Nagios can be used to determine the failure rate
of Grid resources.

12
Natural Synchronisation Points

Examples of natural synchronization points
Barriers.
Top or bottom of a main loop.
Collective operations (broadcast, gather,
scatter, etc.)
No interprocess communication at these points.
Therefore, no need to be concerned with the state
of the communication channels or possible
in-transit message.
Eliminate the overhead incurred due to the
synchronization process involved during
checkpointing.

13
Natural Synchronisation Points (2)
Application Execution with Processes interacting
Coordinated checkpoint - waiting for in-transit
messages
14
Natural Synchronisation Points (4)
Coordinated checkpoint - logging in-transit
messages
Checkpointing at natural synchronisation points.
15
New Checkpointing Approa

Using First Order Approximation only
Involves synchronisation of messages and
capturing in-transit messages.
Checkpointing at natural synchronisation points
only
May not be very effective because there are no
patterns in their occurrences.

16
New Checkpointing Approach(2)

Use a combination of both the Natural
Synchronisation Points and the First Order
Approximation.
Take checkpoints at natural synchronization
points which are closest to the optimal
checkpoint intervals.

17
Choosing Checkpoint Intervals
Choosing appropriate checkpointing intervals
18
Choosing Checkpoint Intervals(2)

Decision to select a checkpoint based on
Optimal checkpoint interval,
Natural synchronisation points and
Critical Region.
Checkpointing process is triggered by signals
sent to the coordinated process whenever
synchronization points are encountered.

19
The Checkpointing Process

When coordinated process receives a signal, it
checks to see if this signal is within the
critical region.
If so, a checkpoint is taken and the clock is
reset.
If not, no checkpointing is performed.
If no natural synchronization points are met
within the critical region, we will have to force
a checkpoint at the end of the critical region.
In such cases, the checkpointing mechanism will
perform synchronization to ensure there are no
lost or orphan messages.

20
The TestBed

Madcity Traffic Simulation tool was used.
Simulates traffic on a road network and shows how
individual vehicles behave on roads and at
junctions.
MadCity traffic simulator can be parallelised
using PGRADE.

21
The Testbed(2)
Proposed checkpointing solution
22
The Testbed(3)

Through the First Order Approximation, the
calculated optimal checkpoint interval was 8
minutes.
A critical region of 2 minutes range from the
optimal checkpoint interval was defined.
Checkpoint taken at Ns1, Ns2, Ns5, Fs1, Ns6,Ns9.
Overall average time between checkpoints 8.2
minutes

23
Conclusion

Proposed checkpointing mechanism provides a
better and more efficient way to save checkpoint
images.
Minimise the need of performing synchronisation
of messages.
Ensure that our average checkpointing interval is
close to the optimal checkpointing interval
defined by the First Order Approximation.

24
Future Works

Integrate the checkpointing solution in PGRADE to
provide an efficient fault tolerant solution to
applications executed as Grid workflows.
Provide an efficient and reliable storage
mechanism.

25
Questions

Write a Comment

User Comments (0)