A Fault Tolerant Protocol for Massively Parallel Machines - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

A Fault Tolerant Protocol for Massively Parallel Machines

Description:

All nodes are rolled back just because 1 crashed ... Ack SN, TN, Message Processor R. Chare Q. Chare P. Buddy. of Processor R. Checkpoint Protocol ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 24

Provided by: PPL7

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Fault Tolerant Protocol for Massively Parallel Machines

1
A Fault Tolerant Protocol for Massively Parallel
Machines

Sayantan Chakravorty
Laxmikant Kale
University of Illinois, Urbana-Champaign

2
Outline

Motivation
Background
Design
Protocols
Results
Summary
Future Work

3
Motivation

As machines grow in size
MTBF decreases
Applications have to tolerate faults
Checkpoint/Rollback doesnt scale
All nodes are rolled back just because 1 crashed
Even nodes independent of the crashed node are
restarted
Restart cost is similar to Checkpoint period

4
Requirements

Fast and scalable Checkpoints
Fast Restart
Only crashed processor to be restarted
Minimize effect on fault free processors
Restart cost less than checkpoint period
Low fault free runtime overhead
Transparent to the user

5
Background

Checkpoint based methods
Coordinated Blocking Tamir84, Non-blocking
Chandy85
Co-check, Starfish, Clip fault tolerant MPI
Uncoordinated suffers from rollback propagation
Communication Briatico84, doesnt scale well
Log-based
Pessimistic MPICH-V1 and V2, SBML Johnson87
Optimistic Strom85 unbounded rollback,
complicated recovery
Causal Logging Elnozahy93 Manetho,
complicated causality tracking and recovery

6
Design

Message Logging
Sender side message logging
Asynchronous checkpoints
Each processor has a buddy processor
Stores its checkpoint in the buddys memory
Processor Virtualization
Speed up restart

7
System Model

Processors are fail-stop
All communication is through messages
Piecewise deterministic assumption holds
Machine has a fault detection system
Network doesnt guarantee delivery order
No fully reliable nodes in the system
Idea of processor virtualization is used

8
Processor Virtualization
User View

Charm
Parallel C with Data driven objects - Chares
Runtime maps objects to physical processors
Asynchronous method invocation
Adaptive MPI
Implemented on Charm
Multiple virtual processors on a physical
processor

9
Benefits of Virtualization

Latency Tolerant
Adaptive overlap of communication and computation
Supports migration of virtual processors

10
Message Logging Protocol

Correctness Messages should be processed in the
same order before and after the crash

Problem
C
A
B
After Crash
Before Crash
11
Message Logging..

Solution
Fix an order the first time and always follow it
Receiver gives each message a ticket number
Process messages in order of ticket number
Each message contains
Sender ID who sent it
Receiver ID to whom was it sent
Sequence Number (SN) together with sender and
receiver IDs, identifies a message
Ticket Number (TN) decide order of processing

12
Message to Remote Chares
Chare P sender
ltSN,TN, Receivergt
ltSN, TN, Messagegt
ltSender, SNgt
Chare Q receiver

If ltsender, SNgt has been seen earlier TN is
marked as received
Otherwise create new TN and store the ltsender,
SN,TNgt

13
Message to Local Chare

Multiple Chares on 1 processor
If processor crashes all trace of local message
is lost
After restart it should have the same TN
Store ltsender, receiver, SN, TNgt on buddy

Processor R
Chare Q
ltSN,TN, Receivergt
ltSN, TN, Messagegt
ltsender, SNgt
Chare P
Ack
ltsender, receiver, SN, TNgt
Buddy of Processor R
14
Checkpoint Protocol

A processor asynchronously decides to checkpoint
Packs up the state of all its chares and sends it
to the buddy
Message logs are part of a chares state
Message log on senders can be garbage collected
Deciding when to checkpoint is an interesting
problem

15
Reliability

Only one scenario when our protocol fails
Processor X (buddy of Y) crashes and restarts
Checkpoint of Y is lost
Y now crashes before saving its checkpoint
Result of not assuming reliable nodes for storing
checkpoint
Still increases reliability by orders of
magnitude
Probability can be minimized by having Y
checkpoint after X crashes and restarts

16
Basic Restart Protocol
After a crash, a Charm process is restarted on
a new processor
Gets checkpoint and local message log from buddy
Chares are restored and other processors are
informed of it
Logged messages for chares on restarted
processors are resent The highest TN, from a
crashed chare, seen is also sent
Messages are reprocessed by the restarted
chares Local messages check first in the restored
local message log
17
Parallel Restart

Message Logging allows fault-free processors to
continue with their execution
However, sooner or later some processors start
waiting for crashed processor
Virtualization allows us to move work from the
restarted processor to waiting processors
Chares are restarted in parallel
Restart cost can be reduced

18
Present Status

Most of Charm has been ported
Support for migration has not yet been
implemented in the fault tolerant protocol
Simple AMPI programs work
Barriers to be done
Parallel restart not yet implemented

19
Experimental Evaluation

NAS benchmarks could not be used
Used a 5-point stencil computation with a 1-D
decomposition
8 quad 500 Mhz PIII cluster with 500 MB of RAM
per node, connected by ethernet

20
Overhead
Measurement of overhead for an application with
low communication to computation ratio
21
Measurement of overhead for an application with
high communication to computation ratio
22
Recovery Performance
Execution Time with increasing number of faults
on 8 processors (Checkpoint period 30s)
23
Summary