Compiler-Assisted Checkpointing for MPI Programs - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Compiler-Assisted Checkpointing for MPI Programs

Description:

Alison N. Smith and Calvin Lin. The University of Texas at Austin ... The University of Texas at Austin. 12. Our Solution. Which checkpoints are legal? ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 39
Provided by: alison106
Learn more at: http://www.cs.utexas.edu
Category:

less

Transcript and Presenter's Notes

Title: Compiler-Assisted Checkpointing for MPI Programs


1
Compiler-Assisted Checkpointing for MPI Programs
  • Alison N. Smith and Calvin Lin
  • The University of Texas at Austin
  • Department of Computer Sciences

2
Introduction
  • High Performance Computing is essential to
    science and engineering
  • Clusters are used for HPC
  • Peak performance scales linearly with the number
    of nodes
  • Reliability decreases with the number of nodes
  • 1 node fails every 10,000 hours
  • 6,000 nodes has a failure every 1.6 hours
  • 64,000 nodes has a failure every 5 minutes (?)

3
Current Solutions
  • Fault tolerance requires redundancy
  • Redundancy in space
  • Each participating process has a backup process
  • Expensive!
  • Redundancy in time
  • Processes save state and then rollback for
    recovery
  • Allows cheaper fault tolerance

4
Todays Answer
  • Programmers place checkpoints
  • Coordinated
  • Utilizes programmer experise about application
  • Where is the problem?
  • Future systems!

5
What is the problem?
  • Today

X
Processes
X
X
Time
6
Decisions, Decisions
  • Checkpoints must be placed carefully
  • Correct
  • Beneficial
  • Cluster architecture
  • Application state

7
In the future
  • Systems will be more complex
  • Programs will be more complex
  • Checkpointing will be more complex
  • Programmer should not waste time and talent
    handling fault-tolerance

8
Solution
  • Transparent Checkpointing
  • Use the compiler to efficiently place checkpoints
  • Low failure-free execution overhead
  • Stagger checkpoints
  • Minimize checkpoint state
  • Support legacy code (MPI)

9
Talk Outline
  • Motivation
  • Our Solution
  • Static version
  • Dynamic version
  • Future Work
  • Related Work
  • Conclusion

10
How do we do it?
1,0,0
Processes
1,1,0
Lamport 78
Time
11
The Intuition
  • Fault-tolerance requires valid recovery lines
  • Many possible valid recovery lines
  • Which ones are best?
  • Flexibility is key

12
Our Solution
  • Which checkpoints are legal?
  • Determine communication pattern statically
  • Where are the recovery lines?
  • Use vector clocks
  • Which recovery line should we use?
  • Heuristic needed

13
Initial Assumptions
  • MPI (explicit communication)
  • Number of nodes at compile-time
  • Deterministic communication pattern
  • (We will talk about AMR soon ? )

14
First Step Communication Pattern
  • Communication pattern can be found
  • p sqrt(no_nodes)
  • cell_coord00 node p
  • cell_coord10 node / p
  • j cell_coord00 - 1
  • i cell_coord10 - 1
  • from_process (i 1 p) p p j
  • MPI_irecv(x, x, x, from_process, )
  • (taken from NAS benchmark bt)

from_process (node / (sqrt(no_nodes)) - 1 1
sqrt(no_nodes)) sqrt(no_nodes)
sqrt(no_nodes) node sqrt(no_nodes) - 1
15
ExampleCommunication Pattern
Processes
Time
16
Second Step Vector Clocks
  • Use communication pattern
  • Instantiate each process
  • Determine vector clocks
  • Discover possible recovery lines

17
ExampleVector Clocks
Processes
Time
18
Final Step Recovery Lines
Randall 75
  • Determining optimal is NP complete Li 94
  • Develop heuristic
  • Rough performance model for staggering
  • Varies by architecture and application
  • Uses experimentation and math
  • Goals
  • Valid recovery line
  • Reduce bandwidth contention
  • Reduce storage space

19
ExampleRecovery Lines
Processes
Time
20
Overview Status
  • Communication Pattern
  • Vector Clocks
  • Identify of possible recovery lines
  • Select recovery line
  • Experimentation
  • Performance model and heuristic

21
The Next Step Dynamic Communication
  • Use static analysis to identify possible recovery
    lines
  • Choose among them at runtime

22
How can we do this?
  • Use static analysis to set up the dynamic
    processing necessary to place sets of checkpoints
    on recovery lines
  • Static analysis at link time
  • Dynamic analysis in software at runtime

23
Dynamic Analysis
  • To determine recovery line, each process needs to
    know
  • Neighbor set
  • Presence of phase changes
  • When others are checkpointing (optimization)
  • Use Inspectors/Executors

24
Static Analysis (Modified)
  • Find any deterministic communication
  • Identify dynamic communication phases
  • Propagate calculations for dynamic communication
  • Identify potential checkpoint locations

25
Inspectors
Saltz 94
  • What are they?
  • Data analyzers that run during execution
  • Inserted by the programmer into the code where
    data needs to be analyzed
  • How can we use them?
  • Insert wherever the communication pattern may
    change
  • Analyze the new communication pattern to collect
    information about potential checkpoint locations

26
Executors
Saltz 94
  • What are they?
  • Influenced by the inspector
  • Perform select actions at runtime
  • How can we use them?
  • Place at potential checkpoint locations
  • Checkpoint
  • Based on inspectors decisions
  • Based on network utilization

27
Checkpoint Locations
  • Checkpoint()
  • Checkpoint()
  • Checkpoint()
  • Checkpoint()
  • Checkpoint()
  • Potential checkpoint locations are placed
    statically
  • Inspector gathers
  • Executors choose where to checkpoint

28
Decision Points
  • Static analysis at link time
  • Allows easy analysis across compilation units
  • Communication often done in libraries
  • Large scope
  • Expensive analyses more forgivable during
    compilation/linking
  • Dynamic decisions
  • Support dynamic communication patterns

29
Future Work
  • Better static algorithm (less brute force)
  • Support dynamically changing communication
    patterns
  • Symbolic reasoning
  • Checkpoint mechanism
  • High-level annotations to capture programmer
    knowledge
  • Rollback and recovery

30
Related Work
  • Checkpointing
  • When?
  • Coordinated
  • Uncoordinated
  • What?
  • Compiler-Assisted Plank 95
  • How?
  • Application-Level Non-Blocking Bronevetsky 2003
  • Message Logging
  • Egida Rao 99

31
Conclusions
  • Checkpointing efficiently is becoming harder
  • We have developed a framework to place
    checkpoints correctly in the program
  • Our framework should reduce failure-free
    execution overhead by
  • Staggering checkpoints across the cluster
  • Placing checkpoints carefully in program to
    reduce state

32
  • ?

33
(No Transcript)
34
Fault Model
35
Vector Clock Formula
36
Message Logging
  • Saves all messages sent to stable storage
  • In the future, storing this data will be
    untenable
  • Message logging relies on checkpointing so that
    logs can be cleared

37
In-flight messagesWhy we dont care
  • We reason about them at the application level so
  • Messages are assumed received at actual receive
    call or at wait
  • We will know if any messages crossed the recovery
    line. We can prepare for recovery by
    checkpointing that information.

38
C-Breeze
  • In-house compiler
  • Allows us to reason about code at various phases
    of compilation
  • Allows us to add our own phases
Write a Comment
User Comments (0)
About PowerShow.com