Title: A Fault Tolerant Protocol for Massively Parallel Machines
1A Fault Tolerant Protocol for Massively Parallel
Machines
- Sayantan Chakravorty
- Laxmikant Kale
- University of Illinois, Urbana-Champaign
2Outline
- Motivation
- Background
- Design
- Protocols
- Results
- Summary
- Future Work
3Motivation
- As machines grow in size
- MTBF decreases
- Applications have to tolerate faults
- Checkpoint/Rollback doesnt scale
- All nodes are rolled back just because 1 crashed
- Even nodes independent of the crashed node are
restarted - Restart cost is similar to Checkpoint period
4Requirements
- Fast and scalable Checkpoints
- Fast Restart
- Only crashed processor to be restarted
- Minimize effect on fault free processors
- Restart cost less than checkpoint period
- Low fault free runtime overhead
- Transparent to the user
5Background
- Checkpoint based methods
- Coordinated Blocking Tamir84, Non-blocking
Chandy85 - Co-check, Starfish, Clip fault tolerant MPI
- Uncoordinated suffers from rollback propagation
- Communication Briatico84, doesnt scale well
- Log-based
- Pessimistic MPICH-V1 and V2, SBML Johnson87
- Optimistic Strom85 unbounded rollback,
complicated recovery - Causal Logging Elnozahy93 Manetho,
complicated causality tracking and recovery
6Design
- Message Logging
- Sender side message logging
- Asynchronous checkpoints
- Each processor has a buddy processor
- Stores its checkpoint in the buddys memory
- Processor Virtualization
- Speed up restart
7System Model
- Processors are fail-stop
- All communication is through messages
- Piecewise deterministic assumption holds
- Machine has a fault detection system
- Network doesnt guarantee delivery order
- No fully reliable nodes in the system
- Idea of processor virtualization is used
8Processor Virtualization
User View
- Charm
- Parallel C with Data driven objects - Chares
- Runtime maps objects to physical processors
- Asynchronous method invocation
- Adaptive MPI
- Implemented on Charm
- Multiple virtual processors on a physical
processor
9Benefits of Virtualization
- Latency Tolerant
- Adaptive overlap of communication and computation
- Supports migration of virtual processors
10Message Logging Protocol
- Correctness Messages should be processed in the
same order before and after the crash
Problem
C
A
B
After Crash
Before Crash
11Message Logging..
- Solution
- Fix an order the first time and always follow it
- Receiver gives each message a ticket number
- Process messages in order of ticket number
- Each message contains
- Sender ID who sent it
- Receiver ID to whom was it sent
- Sequence Number (SN) together with sender and
receiver IDs, identifies a message - Ticket Number (TN) decide order of processing
12Message to Remote Chares
Chare P sender
ltSN,TN, Receivergt
ltSN, TN, Messagegt
ltSender, SNgt
Chare Q receiver
- If ltsender, SNgt has been seen earlier TN is
marked as received - Otherwise create new TN and store the ltsender,
SN,TNgt
13Message to Local Chare
- Multiple Chares on 1 processor
- If processor crashes all trace of local message
is lost - After restart it should have the same TN
- Store ltsender, receiver, SN, TNgt on buddy
Processor R
Chare Q
ltSN,TN, Receivergt
ltSN, TN, Messagegt
ltsender, SNgt
Chare P
Ack
ltsender, receiver, SN, TNgt
Buddy of Processor R
14Checkpoint Protocol
- A processor asynchronously decides to checkpoint
- Packs up the state of all its chares and sends it
to the buddy - Message logs are part of a chares state
- Message log on senders can be garbage collected
- Deciding when to checkpoint is an interesting
problem
15Reliability
- Only one scenario when our protocol fails
- Processor X (buddy of Y) crashes and restarts
- Checkpoint of Y is lost
- Y now crashes before saving its checkpoint
- Result of not assuming reliable nodes for storing
checkpoint - Still increases reliability by orders of
magnitude - Probability can be minimized by having Y
checkpoint after X crashes and restarts
16Basic Restart Protocol
After a crash, a Charm process is restarted on
a new processor
Gets checkpoint and local message log from buddy
Chares are restored and other processors are
informed of it
Logged messages for chares on restarted
processors are resent The highest TN, from a
crashed chare, seen is also sent
Messages are reprocessed by the restarted
chares Local messages check first in the restored
local message log
17Parallel Restart
- Message Logging allows fault-free processors to
continue with their execution - However, sooner or later some processors start
waiting for crashed processor - Virtualization allows us to move work from the
restarted processor to waiting processors - Chares are restarted in parallel
- Restart cost can be reduced
18Present Status
- Most of Charm has been ported
- Support for migration has not yet been
implemented in the fault tolerant protocol - Simple AMPI programs work
- Barriers to be done
- Parallel restart not yet implemented
19Experimental Evaluation
- NAS benchmarks could not be used
- Used a 5-point stencil computation with a 1-D
decomposition - 8 quad 500 Mhz PIII cluster with 500 MB of RAM
per node, connected by ethernet
20Overhead
Measurement of overhead for an application with
low communication to computation ratio
21Measurement of overhead for an application with
high communication to computation ratio
22Recovery Performance
Execution Time with increasing number of faults
on 8 processors (Checkpoint period 30s)
23Summary
- Designed a fault tolerant protocol that
- Performs fast checkpoints
- Performs fast parallel restarts
- Doesnt depend on any completely reliable node
- Supports multiple faults
- Minimizes the effect of a crash on fault free
processors - Partial implementation of the protocol
24Future Work
- Include support for migration in the protocol
- Parallel restart
- Extend to AMPI
- Test with NAS benchmark
- Study the tradeoffs involved in deciding the
checkpoint period