Title: ApplicationLevel Fault Tolerance for MPI Programs
1Application-Level Fault Tolerance for MPI
Programs
- Keshav Pingali
- http//iss.cs.cornell.edu
Joint work with Greg Bronevetsky, Rohit
Fernandes, Daniel Marques, Paul Stodghill
2The Problem
- Old picture of high-performance computing
- Turn-key big-iron platforms
- Short-running codes
- Modern high-performance computing
- Less Reliable Platforms
- Extremely large systems (millions of parts)
- Large clusters of commodity parts
- Grid Computing
- Long-running codes
- Program runtimes greatly exceed mean time
- to failure
- ASCI, Blue Gene, PSC, Illinois Rocket Center
-
- ?Fault-tolerance is critical
3Fault tolerance
- Fault tolerance comes in different flavors
- Mission-critical systems (eg) air traffic
control system - No down-time, fail-over, redundancy
- Computational applications
- Restart after failure, minimizing lost work
- Guarantee progress
4Fault Models
- Fail-stop
- Failed process dies silently
- Does not send corrupted messages
- Does not corrupt shared data
- Byzantine
- Arbitrary misbehavior is allowed
- Our focus
- Fail-stop faults
5Fault tolerance strategies
- Our experience
- Scientific programs communicate more frequently
than distributed systems. - Message-logging is not practical for scientific
programs.
6Checkpoint/restart (CPR)
- System-level checkpointing (SLC) (eg) Condor
- core-dump style snapshots of computations
- very architecture and OS dependent
- checkpoints are not portable
- Application-level checkpointing (ALC)
- program saves and restores its own state
- (eg) n-body codes save and restore positions and
velocities of particles - programs are self-checkpointing and
self-restarting - amount of state saved can be much smaller than
SLC - IBMs BlueGene protein folding
- Megabytes vs terabytes
- Alegra (Sandia)
- App. level restart file only 5 of core size
- Disadvantage of current application-level
check-pointing - manual implementation
- requires global barriers in programs
7Our Approach
- Automate application-level check-pointing
- Minimize programmer annotations
- Generalize to arbitrary MPI programs w/o barriers
Application State-saving
Original Application
Precompiler
Failure detector
Thin Coordination Layer
MPI Implementation
MPI Implementation
Reliable communication layer
8Outline
- Saving single-process state
- Stack, heap, globals, locals,
- Coordination of single-process states into a
global snapshot - Basic issues
- Crossing messages, non-determinism,
- Our protocol for point-to-point messages
- Collective communication
- Implementation Results
- Overheads are minimal
9System ArchitectureSingle-Processor Checkpointing
Application State-saving
Precompiler
Original Application
Failure detector
Thin Coordination Layer
MPI Implementation
MPI Implementation
Reliable communication layer
10Precompiler
- Where to checkpoint
- At calls to potentialCheckpoint() function
- Mandatory calls in main process (initiator)
- Other calls are optional
- Process checks if global checkpoint has been
initiated, and if so, joins in protocol to save
state - Inserted by programmer or automated tool
- Currently inserted by programmer
- Transformed program can save its state at calls
to potentialCheckpoint()
11Saving Application State
- Must save
- Heap we provide special malloc that tracks the
memory it allocates - Globals precompiler knows the globals inserts
statements to explicitly save them - Call Stack, Locals and Program Counter - maintain
a separate stack which records all functions that
got called and the local vars inside them. - Similar to work done with PORCH (MIT)
- PORCH is portable but not transparent to
programmer
12Example
- main()
-
- int a
- VDS.push(a, sizeof a)
- if(restart)
- load LS
- copy LS to LS.old
- jump dequeue(LS.old)
- //
- LS.push(2)
- label2
- function()
- LS.pop()
- //
- VDS.pop()
-
- function()
-
- int b
- VDS.push(b, sizeof b)
- if(restart)
- jump dequeue(LS.old)
- //
- LS.push(2)
- take_ckpt()
- label2
- if(restart)
- load VDS
- restore variables
- LS.pop()
- //
- VDS.pop()
-
13Reducing saved state
- Statically determine spots in the code with the
least amount of state - Determine live data at the time of a checkpoint
- Incremental state-saving
- Recomputation vs saving state
- eg. Protein folding, AB C
- Prior work
- CATCH (Illinois) uses runtime learning rather
than static analysis - Beck, Plank and Kingsley (UTK) memory exclusion
analysis of static data
14Outline
- Saving single-process state
- Stack, Heap, Globals, Locals,
- Coordination of single-process states into a
global snapshot - Basic issues
- Crossing messages, non-determinism,
- Our Protocol
- Collective communication
- Implementation Results
- Overheads are minimal
15System ArchitectureDistributed Checkpointing
Application State-saving
Precompiler
Original Application
Failure detector
Thin Coordination Layer
MPI Implementation
MPI Implementation
Reliable communication layer
16Need for Coordination
Ps Checkpoint
Early Message
Process P
Past Message
Future Message
Process Q
Late Message
Qs Checkpoint
- Horizontal Lines events in each process
- Recovery Line
- line connecting checkpoints on each processor
- represents global system state on recovery
- Problem with Communication
- messages may cross recovery line
17Late Messages
Ps Checkpoint
Process P
Process Q
Late Message
Qs Checkpoint
- Must record message data at receiver as part of
checkpoint - On recovery re-read recorded message data
18Early Messages
Ps Checkpoint
Early Message
Process P
Process Q
Qs Checkpoint
- Must suppress the resending of message on recovery
19Early Messages
Ps Checkpoint
Early Message
Process P
Process Q
Qs Checkpoint
- Must suppress the resending of message on
recovery - What about non-deterministic events before the
send? - Must ensure the application generates the same
early message on recovery - Record and replay all non-deterministic events
between checkpoint and send
20Difficulty of Coordination
Ps Checkpoint
Process P
Process Q
Qs Checkpoint
- No communication ? no coordination necessary
21Difficulty of Coordination
Ps Checkpoint
Process P
Past Message
Future Message
Process Q
Qs Checkpoint
- No communication ? no coordination necessary
- BSP Style programs ? checkpoint at barrier
22Difficulty of Coordination
Ps Checkpoint
Process P
Past Message
Future Message
Process Q
Late Message
Qs Checkpoint
- No communication ? no coordination necessary
- BSP Style programs ? checkpoint at barrier
- General MIMD programs
23Difficulty of Coordination
Ps Checkpoint
Process P
Past Message
Future Message
Process Q
Late Message
Qs Checkpoint
- No communication ? no coordination necessary
- BSP Style programs ? checkpoint at barrier
- General MIMD programs
- System-level checkpointing (eg. Chandy-Lamport)
- Forces checkpoints to avoid early messages
- Assumed by existing work
24Difficulty of Coordination
Ps Checkpoint
Early Message
Process P
Past Message
Future Message
Process Q
Late Message
Qs Checkpoint
- No communication ? no coordination necessary
- BSP Style programs ? checkpoint at barrier
- General MIMD programs
- System-level checkpointing (eg. Chandy-Lamport)
- Only late messages
- Application-level checkpointing
- Checkpoint locations fixed, may not force
- Late and early messages
- Requires new protocol
25MPI-specific issues
- Non-FIFO communication tags
- Non-blocking communication
- Collective communication
- MPI_Reduce(), MPI_AllGather(), MPI_Bcast()
- Internal MPI library state
- Visible
- non-blocking request objects, datatypes,
communicators, attributes - Invisible
- internal timers, buffers, IP address mappings,
etc.
26Outline
- Saving single-process state
- Stack, heap, globals, locals,
- Coordination of single-process states into a
global snapshot - Basic issues
- Crossing messages, non-determinism,
- Our Protocol
- Collective communication
- Implementation Results
- Overheads are minimal
27The Global View
Initiator
Epoch 0
Epoch 1
Epoch 2
Epoch n
Process P
Process Q
- A programs execution is divided into a series of
disjoint epochs - Epochs are separated by recovery lines
- A failure in Epoch n means all processes roll
back to the prior recovery line
28Protocol Outline (I)
Initiator
pleaseCheckpoint
Process P
Process Q
- Initiator checkpoints, sends pleaseCheckpoint
message to all others - After receiving this message, process checkpoints
at the next available spot
29Protocol Outline (II)
Initiator
pleaseCheckpoint
Process P
Recording
Process Q
- After checkpointing, each process keeps a record,
containing - data of messages from last epoch (Late messages)
- non-deterministic events
- (that happened before Early message sends)
30Protocol Outline (IIIa)
Initiator
Process P
Process Q
- Globally, ready to stop recording when
- all processes have received their late messages
- all processes have sent their early messages
31Protocol Outline (IIIb)
Initiator
readyToStopRecording
Process P
Process Q
- Locally, when a process
- has received all its late messages
- has sent all its early messages
- ? sends a readyToStopRecording message to
Initiator.
32Protocol Outline (IV)
Initiator
stopRecording
stopRecording
Process P
Application Message
Process Q
- When initiator receives readyToStopRecording from
everyone, it sends stopRecording to everyone - Stop recording when
- received stopRecording message OR
- a message from a process that has stopped
recording
33Protocol Discussion
Initiator
stopRecording
?
Process P
Application Message
Process Q
- Why cant we just wait to receive stopRecording
message? - Our record would depend on a non-deterministic
event, invalidating it. - The application message may be different or may
not be resent on recovery.
34Protocol Details
- Piggyback 4 bytes of control data to tell if a
message is late, early, etc. - Can be reduced to 2 bits
- On recovery
- reinitialize MPI library using MPI_Init()
- restore the single-process state
- recreate datatypes and communicators
- ensure that all calls to MPI_Send() and
MPI_Recv() are suppressed/fed with data as
necessary
35Outline
- Saving single-process state
- Stack, heap, globals, locals,
- Coordination of single-process states into a
global snapshot - Basic issues
- Crossing messages, non-determinism,
- Our Protocol
- Collective communication
- Implementation Results
- Overheads are minimal
36MPI Collective Communications
- Single communication involving multiple processes
- Single-Receiver multiple senders, one receiver
- e.g. Gather, Reduce
- Single-Sender one sender, multiple receivers
- e.g. Bcast, Scatter
- AlltoAll every process in group sends data to
every other process - e.g. AlltoAll, AllGather, AllReduce, Scan
- Barrier everybody waits for everybody else to
reach barrier before going on. - (Only collective call with explicit
synchronization guarantee)
37Possible Solutions
- We have a protocol for point-to-point messages.
Why not reimplement all collectives as
point-to-point messages? - Lots of work and less efficient than native
implementation. - Checkpoint collectives directly without breaking
them up. - May be complex but requires no reimplementation
of MPI internals.
38AlltoAll Example
MPI_AlltoAll()
Process P
Process Q
MPI_AlltoAll()
Process R
MPI_AlltoAll()
- Data flows represent application-level semantics
of how data travels - Do NOT correspond to real messages
- Used to reason about applications view of
communications - AlltoAll data flows are a bidirectional clique
since data flows in both directions - Recovery line may be
- Before the AlltoAll
- After the AlltoAll
- Straddling the AlltoAll
39AlltoAll Example
MPI_AlltoAll()
Process P
Process Q
MPI_AlltoAll()
Process R
MPI_AlltoAll()
- Before the AlltoAll No Problem
- On recovery application will reexecute the
AlltoAll - After the AlltoAll No Problem
- Application wont care about AlltoAll
40Straddling AlltoAll What to do
Process P
Process Q
The Record
Process R
- Straddling the AlltoAll only single case
- P?Q and P?R Late/Early data flows
- Record result and replay for P
- Suppress Ps call to MPI_AlltoAll
- Record/replay non-determinism before Ps
MPI_AlltoAll call
41Collective Communication
- Single-sender/single collector collectives have a
similar solution - May also reissue some MPI calls
- Barrier very different, requires new solution
42Barrier
MPI_Barrier()
Process P
Process Q
MPI_Barrier()
Process R
MPI_Barrier()
- Recovery line before or after Barrier No Problem
43Barrier
Process P
Process Q
Process R
- Recovery straddles Barrier Problem!
- No way for recovery to uphold Barrier
synchronization semantics - No process may pass the barrier until every other
process has reached it in real time
44Barrier
Process P
Process Q
Process R
- Solution ensure that barriers may not straddle
recovery lines - Precede each Barrier with a special checkpoint
spot. - If one node took a checkpoint before the Barrier,
everybody else does too.
45Outline
- Saving single-process state
- Stack, heap, globals, locals,
- Coordination of single-process states into a
global snapshot - Basic issues
- Crossing messages, non-determinism,
- Our Protocol
- Collective communication
- Implementation Results
46Implementation
- Several sequential platforms (cf. Condor)
- Linux Dell PowerEdge 1650
- Solaris Sun V210
- Two large-scale parallel platforms
- Lemieux 750 node Alphaserver at Pittsburgh
- Velocity 2 128 dual-processor Windows cluster
- Benchmarks
- NAS suite CG, LU, SP
- SMG2000, HPL
47Sequential Experiments (vs Condor)
- Checkpoint sizes are comparable.
- Ongoing work reduce checkpoint sizes through
compiler analysis
48Runtimes on Lemieux
- Compare original running time of code with
running time using C3 system without taking any
checkpoints - Chronic overhead is small (lt 10)
49Runtimes on V2
- Overheads on Windows cluster are also small,
except for SMG2000. - Relatively large overhead in SMG2000 might be due
to initialization code.
50Overheads w/checkpointing on Lemieux
- Configuration 1 no checkpoints
- Configuration 2 go through motions of taking 1
checkpoint, but nothing written to disk - Configuration 3 write checkpoint data to local
disk on each machine - Measurement noise about 2-3
- Conclusion relative cost of taking a checkpoint
is fairly small
51Overhead w/checkpointing on V2
- Overheads on V2 for taking checkpoints are also
fairly small.
52Other work
- Designed a similar protocol for shared-memory
programs. - Implemented protocol and evaluated on SPLASH-2
programs - Overheads are small (lt10).
- Paper in ASPLOS 2004
53Ongoing work
- Compiler analysis to reduce the amount of saved
state (with Radu Rugina) - Identify live data
- Incremental checkpointing
- Recomputation vs state-saving
- Portable checkpointing (Rohit Fernandes)
- Restart checkpoint on different machine
- Useful for task migration in grid environment
- MPI-2
- One-way communication
54Contributions
- Developed system for making MPI apps fault
tolerant - Precompiler-based single-process checkpointer
- Minimal programmer annotations
- Developed and implemented a novel protocol for
distributed application-level checkpointing - Works with any single-process checkpointer
- Can transparently handle all features of MPI
- Non-FIFO, non-blocking, collective,
communicators, etc. - Portable across MPI implementations
- Components orthogonal
- Can be used/applied independently
- Extended to shared-memory (OpenMP) programs
- Overhead is low