A Fault Tolerant Protocol for Massively Parallel Machines - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

A Fault Tolerant Protocol for Massively Parallel Machines

Description:

All nodes are rolled back just because 1 crashed ... Ack SN, TN, Message Processor R. Chare Q. Chare P. Buddy. of Processor R. Checkpoint Protocol ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 24
Provided by: PPL7
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: A Fault Tolerant Protocol for Massively Parallel Machines


1
A Fault Tolerant Protocol for Massively Parallel
Machines
  • Sayantan Chakravorty
  • Laxmikant Kale
  • University of Illinois, Urbana-Champaign

2
Outline
  • Motivation
  • Background
  • Design
  • Protocols
  • Results
  • Summary
  • Future Work

3
Motivation
  • As machines grow in size
  • MTBF decreases
  • Applications have to tolerate faults
  • Checkpoint/Rollback doesnt scale
  • All nodes are rolled back just because 1 crashed
  • Even nodes independent of the crashed node are
    restarted
  • Restart cost is similar to Checkpoint period

4
Requirements
  • Fast and scalable Checkpoints
  • Fast Restart
  • Only crashed processor to be restarted
  • Minimize effect on fault free processors
  • Restart cost less than checkpoint period
  • Low fault free runtime overhead
  • Transparent to the user

5
Background
  • Checkpoint based methods
  • Coordinated Blocking Tamir84, Non-blocking
    Chandy85
  • Co-check, Starfish, Clip fault tolerant MPI
  • Uncoordinated suffers from rollback propagation
  • Communication Briatico84, doesnt scale well
  • Log-based
  • Pessimistic MPICH-V1 and V2, SBML Johnson87
  • Optimistic Strom85 unbounded rollback,
    complicated recovery
  • Causal Logging Elnozahy93 Manetho,
    complicated causality tracking and recovery

6
Design
  • Message Logging
  • Sender side message logging
  • Asynchronous checkpoints
  • Each processor has a buddy processor
  • Stores its checkpoint in the buddys memory
  • Processor Virtualization
  • Speed up restart

7
System Model
  • Processors are fail-stop
  • All communication is through messages
  • Piecewise deterministic assumption holds
  • Machine has a fault detection system
  • Network doesnt guarantee delivery order
  • No fully reliable nodes in the system
  • Idea of processor virtualization is used

8
Processor Virtualization
User View
  • Charm
  • Parallel C with Data driven objects - Chares
  • Runtime maps objects to physical processors
  • Asynchronous method invocation
  • Adaptive MPI
  • Implemented on Charm
  • Multiple virtual processors on a physical
    processor

9
Benefits of Virtualization
  • Latency Tolerant
  • Adaptive overlap of communication and computation
  • Supports migration of virtual processors

10
Message Logging Protocol
  • Correctness Messages should be processed in the
    same order before and after the crash

Problem
C
A
B
After Crash
Before Crash
11
Message Logging..
  • Solution
  • Fix an order the first time and always follow it
  • Receiver gives each message a ticket number
  • Process messages in order of ticket number
  • Each message contains
  • Sender ID who sent it
  • Receiver ID to whom was it sent
  • Sequence Number (SN) together with sender and
    receiver IDs, identifies a message
  • Ticket Number (TN) decide order of processing

12
Message to Remote Chares
Chare P sender
ltSN,TN, Receivergt
ltSN, TN, Messagegt
ltSender, SNgt
Chare Q receiver
  • If ltsender, SNgt has been seen earlier TN is
    marked as received
  • Otherwise create new TN and store the ltsender,
    SN,TNgt

13
Message to Local Chare
  • Multiple Chares on 1 processor
  • If processor crashes all trace of local message
    is lost
  • After restart it should have the same TN
  • Store ltsender, receiver, SN, TNgt on buddy

Processor R
Chare Q
ltSN,TN, Receivergt
ltSN, TN, Messagegt
ltsender, SNgt
Chare P
Ack
ltsender, receiver, SN, TNgt
Buddy of Processor R
14
Checkpoint Protocol
  • A processor asynchronously decides to checkpoint
  • Packs up the state of all its chares and sends it
    to the buddy
  • Message logs are part of a chares state
  • Message log on senders can be garbage collected
  • Deciding when to checkpoint is an interesting
    problem

15
Reliability
  • Only one scenario when our protocol fails
  • Processor X (buddy of Y) crashes and restarts
  • Checkpoint of Y is lost
  • Y now crashes before saving its checkpoint
  • Result of not assuming reliable nodes for storing
    checkpoint
  • Still increases reliability by orders of
    magnitude
  • Probability can be minimized by having Y
    checkpoint after X crashes and restarts

16
Basic Restart Protocol
After a crash, a Charm process is restarted on
a new processor
Gets checkpoint and local message log from buddy
Chares are restored and other processors are
informed of it
Logged messages for chares on restarted
processors are resent The highest TN, from a
crashed chare, seen is also sent
Messages are reprocessed by the restarted
chares Local messages check first in the restored
local message log
17
Parallel Restart
  • Message Logging allows fault-free processors to
    continue with their execution
  • However, sooner or later some processors start
    waiting for crashed processor
  • Virtualization allows us to move work from the
    restarted processor to waiting processors
  • Chares are restarted in parallel
  • Restart cost can be reduced

18
Present Status
  • Most of Charm has been ported
  • Support for migration has not yet been
    implemented in the fault tolerant protocol
  • Simple AMPI programs work
  • Barriers to be done
  • Parallel restart not yet implemented

19
Experimental Evaluation
  • NAS benchmarks could not be used
  • Used a 5-point stencil computation with a 1-D
    decomposition
  • 8 quad 500 Mhz PIII cluster with 500 MB of RAM
    per node, connected by ethernet

20
Overhead
Measurement of overhead for an application with
low communication to computation ratio
21
Measurement of overhead for an application with
high communication to computation ratio
22
Recovery Performance
Execution Time with increasing number of faults
on 8 processors (Checkpoint period 30s)
23
Summary
  • Designed a fault tolerant protocol that
  • Performs fast checkpoints
  • Performs fast parallel restarts
  • Doesnt depend on any completely reliable node
  • Supports multiple faults
  • Minimizes the effect of a crash on fault free
    processors
  • Partial implementation of the protocol

24
Future Work
  • Include support for migration in the protocol
  • Parallel restart
  • Extend to AMPI
  • Test with NAS benchmark
  • Study the tradeoffs involved in deciding the
    checkpoint period
Write a Comment
User Comments (0)
About PowerShow.com