Compiler-Assisted Checkpointing for MPI Programs - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Compiler-Assisted Checkpointing for MPI Programs

Description:

Alison N. Smith and Calvin Lin. The University of Texas at Austin ... The University of Texas at Austin. 12. Our Solution. Which checkpoints are legal? ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 39

Provided by: alison106

Learn more at: http://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiler-Assisted Checkpointing for MPI Programs

1
Compiler-Assisted Checkpointing for MPI Programs

Alison N. Smith and Calvin Lin
The University of Texas at Austin
Department of Computer Sciences

2
Introduction

High Performance Computing is essential to
science and engineering
Clusters are used for HPC
Peak performance scales linearly with the number
of nodes
Reliability decreases with the number of nodes
1 node fails every 10,000 hours
6,000 nodes has a failure every 1.6 hours
64,000 nodes has a failure every 5 minutes (?)

3
Current Solutions

Fault tolerance requires redundancy
Redundancy in space
Each participating process has a backup process
Expensive!
Redundancy in time
Processes save state and then rollback for
recovery
Allows cheaper fault tolerance

4
Todays Answer

Programmers place checkpoints
Coordinated
Utilizes programmer experise about application
Where is the problem?
Future systems!

5
What is the problem?

Today

X
Processes
X
X
Time
6
Decisions, Decisions

Checkpoints must be placed carefully
Correct
Beneficial
Cluster architecture
Application state

7
In the future

Systems will be more complex
Programs will be more complex
Checkpointing will be more complex
Programmer should not waste time and talent
handling fault-tolerance

8
Solution

Transparent Checkpointing
Use the compiler to efficiently place checkpoints
Low failure-free execution overhead
Stagger checkpoints
Minimize checkpoint state
Support legacy code (MPI)

9
Talk Outline

Motivation
Our Solution
Static version
Dynamic version
Future Work
Related Work
Conclusion

10
How do we do it?
1,0,0
Processes
1,1,0
Lamport 78
Time
11
The Intuition

Fault-tolerance requires valid recovery lines
Many possible valid recovery lines
Which ones are best?
Flexibility is key

12
Our Solution

Which checkpoints are legal?
Determine communication pattern statically
Where are the recovery lines?
Use vector clocks
Which recovery line should we use?
Heuristic needed

13
Initial Assumptions

MPI (explicit communication)
Number of nodes at compile-time
Deterministic communication pattern
(We will talk about AMR soon ? )

14
First Step Communication Pattern

Communication pattern can be found
p sqrt(no_nodes)
cell_coord00 node p
cell_coord10 node / p
j cell_coord00 - 1
i cell_coord10 - 1
from_process (i 1 p) p p j
MPI_irecv(x, x, x, from_process, )
(taken from NAS benchmark bt)

from_process (node / (sqrt(no_nodes)) - 1 1
sqrt(no_nodes)) sqrt(no_nodes)
sqrt(no_nodes) node sqrt(no_nodes) - 1
15
ExampleCommunication Pattern
Processes
Time
16
Second Step Vector Clocks

Use communication pattern
Instantiate each process
Determine vector clocks
Discover possible recovery lines

17
ExampleVector Clocks
Processes
Time
18
Final Step Recovery Lines
Randall 75

Determining optimal is NP complete Li 94
Develop heuristic
Rough performance model for staggering
Varies by architecture and application
Uses experimentation and math
Goals
Valid recovery line
Reduce bandwidth contention
Reduce storage space

19
ExampleRecovery Lines
Processes
Time
20
Overview Status

Communication Pattern
Vector Clocks
Identify of possible recovery lines
Select recovery line
Experimentation
Performance model and heuristic

21
The Next Step Dynamic Communication

Use static analysis to identify possible recovery
lines
Choose among them at runtime

22
How can we do this?

Use static analysis to set up the dynamic
processing necessary to place sets of checkpoints
on recovery lines
Static analysis at link time
Dynamic analysis in software at runtime

23
Dynamic Analysis

To determine recovery line, each process needs to
know
Neighbor set
Presence of phase changes
When others are checkpointing (optimization)
Use Inspectors/Executors

24
Static Analysis (Modified)

Find any deterministic communication
Identify dynamic communication phases
Propagate calculations for dynamic communication
Identify potential checkpoint locations

25
Inspectors
Saltz 94

What are they?
Data analyzers that run during execution
Inserted by the programmer into the code where
data needs to be analyzed
How can we use them?
Insert wherever the communication pattern may
change
Analyze the new communication pattern to collect
information about potential checkpoint locations

26
Executors
Saltz 94

What are they?
Influenced by the inspector
Perform select actions at runtime
How can we use them?
Place at potential checkpoint locations
Checkpoint
Based on inspectors decisions
Based on network utilization

27
Checkpoint Locations

Checkpoint()
Checkpoint()
Checkpoint()
Checkpoint()
Checkpoint()

Potential checkpoint locations are placed
statically
Inspector gathers
Executors choose where to checkpoint

28
Decision Points

Static analysis at link time
Allows easy analysis across compilation units
Communication often done in libraries
Large scope
Expensive analyses more forgivable during
compilation/linking
Dynamic decisions
Support dynamic communication patterns

29
Future Work

Better static algorithm (less brute force)
Support dynamically changing communication
patterns
Symbolic reasoning
Checkpoint mechanism
High-level annotations to capture programmer
knowledge
Rollback and recovery

30
Related Work

Checkpointing
When?
Coordinated
Uncoordinated
What?
Compiler-Assisted Plank 95
How?
Application-Level Non-Blocking Bronevetsky 2003
Message Logging
Egida Rao 99

31
Conclusions

Checkpointing efficiently is becoming harder
We have developed a framework to place
checkpoints correctly in the program
Our framework should reduce failure-free
execution overhead by
Staggering checkpoints across the cluster
Placing checkpoints carefully in program to
reduce state

33
(No Transcript)
34
Fault Model
35
Vector Clock Formula
36
Message Logging

Saves all messages sent to stable storage
In the future, storing this data will be
untenable
Message logging relies on checkpointing so that
logs can be cleared

37
In-flight messagesWhy we dont care

We reason about them at the application level so
Messages are assumed received at actual receive
call or at wait
We will know if any messages crossed the recovery
line. We can prepare for recovery by
checkpointing that information.

38
C-Breeze