Fault Tolerant Parallel Programming - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Fault Tolerant Parallel Programming

Description:

Library Style compile time linking. 6. FTMPI Fault Detection ... Stack only Register Set. User Data. Saved to working directory, which can be file server ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 17
Provided by: karlo6
Category:

less

Transcript and Presenter's Notes

Title: Fault Tolerant Parallel Programming


1
Fault Tolerant Parallel Programming
  • Karl Orbell

kao98_at_doc.ic.ac.uk
2
Parallel Programming
  • Massively Parallel Processor (MPP) machines
  • Expensive
  • Supercomputers
  • GRIDs
  • Internet
  • Heterogeneous
  • Desktop machines
  • Relatively cheap

3
The Problem
  • The Internet is Unstable
  • Failures
  • network failures
  • machine failures
  • failure more likely in larger systems
  • Parallel Programs
  • long running
  • extensive lost work

4
The Solution
  • State Saving
  • What needs saving?
  • internal process space, data
  • external I/O streams, network, physical
  • Problems
  • General unrestorable state
  • Parallel heterogeneity, permanent failure
  • Synchronize versus Copy-on-Write
  • Parallel checkpoint file availability

5
FTMPI Fault Tolerant MPI
  • Custom MPI implementation
  • Existing implementations too complex to modify
  • Limitations
  • basic command set
  • no communicators
  • Objective MPI parallel checkpointing
  • ANSI C API same style as MPI Interface
  • Library Style compile time linking

6
FTMPI Fault Detection
  • Network Faults
  • Single or Multiple node failure
  • TCP/IP Fault Handling
  • not enough, nothing reported if machine crashes
  • Heartbeats
  • Regular keep-alive messages, if the
    heartbeats stop, then the node has died
  • Timeouts maximum time between heartbeats

7
FTMPI - Implementation
  • Threaded Servers for Communications
  • pthreads
  • Message Reception Server
  • Heartbeat Send Receive Servers
  • Alternatives?
  • Signals
  • Interrupts
  • Alarms

8
FTMPI - Interfaces
9
Checkpointing - Synchronization
  • Synchronization all must be stationary
  • External
  • All nodes, must checkpoint together
  • MPI_Barrier() is used to synchronize
  • Internal
  • Threads must be paused
  • Yield points at regular intervals, to pause if
    necessary

10
Checkpointing Local Checkpoint
  • Stack only Register Set
  • User Data
  • Saved to working directory, which can be file
    server
  • FTMPI does not handle
  • Entire Heap, because of thread libraries problems
  • I/O streams, because of complexities across
    parallel systems

11
User-Directed Checkpointing
  • Explicitly placed checkpointing points
  • at end of iterations
  • and other low data places
  • Save only requested data

while (continue) / iteration work / /
store data / if (FTMPI_Perform_Checkpoint()
FTMPI_RESTORED) / restore data /
12
Checkpoint Procedure
13
Performance - MPI
14
Performance - Checkpointing
15
Demonstration - Pi
  • Monte Carlo Approximation of Pi
  • 4 machines
  • Checkpointing at regular intervals
  • Restoration from last checkpoint

16
Q

A
Write a Comment
User Comments (0)
About PowerShow.com