Title: Consistent Cuts and Un-coordinated Check-pointing
1Consistent Cuts and Un-coordinated
Check-pointing
2Cuts
e1
e2
e3
x
x
x
e0
x
e6
x
x
x
e4
e5
x
x
x
x
e7
e8
e9
e10
x
x
x
e11
e12
e13
- Subset C of events in computation
- some definitions require at least one event from
each process - For each process P, events in C that executed on
P form an initial prefix of all events that
executed on P - Cut e0,e1,e2,e4,e7 Not a cut e0,e2,e4,e7
- Frontier of cut subset of cut containing last
events on each process - for our example, e2,e4,e7
3Equivalent definition of cut
e1
e2
e3
x
x
x
e0
x
e6
x
x
x
e4
e5
x
x
x
x
e7
e8
e9
e10
x
x
x
e11
e12
e13
- Subset C of events in computation
- If e e C, and e ? e, and e and e executed on
same process, then e e C. - What happens if we remove condition that e and e
were executed on same process?
4Consistent cut
e1
e2
e3
x
x
x
e0
x
e6
e4
e5
x
x
x
e7
e8
e9
e10
x
x
x
x
x
x
x
e11
e12
e13
- Subset C of events in computation
- If e e C, and e ? e, then e e C
- Consistent cut e0, e1, e2, e4, e5,e7
- note e5?e2 but cut is still consistent by our
definition - Inconsistent cut e0,e1,e2,e4,e7
- Not a cut e0,e2,e4,e7
5Properties of consistent cuts(0)
e
x
x
x
e0
x
e6
e4
e5
x
x
x
e7
e8
e9
e10
x
x
x
x
x
x
x
x
e
e11
e12
e13
- If cut is inconsistent, there must be a message
such that receiving event is in C but sending
event is not. - Proof there must an e and e such e?e, e in C
but e not in C. Consider the chain e?e0?e1?e.
There must be events ei?ej in this chain such
that events e,e0,ei are not in C, but ej is in
C. Clearly, ei and ej must be executed by
different processes. Therefore, ei is send and ej
is receive.
6Properties of consistent cuts(I)
x
x
x
e0
x
e6
e4
e5
x
x
x
e7
e8
e9
e10
x
x
x
x
x
x
x
e11
e12
e13
- Let e P be a computational event on a frontier of
a consistent cut C. If e P ? eQ , then eQ
cannot be in C. - Proof Consider the causal chain e P ? e1? eQ.
- Event e1 must execute on process P because
e P is a computational event. If e P is on
frontier, e1 is not. By definition of consistent
cut, eQ cannot be in consistent cut. -
7Properties (II)
x
x
x
e0
x
e6
e4
e5
x
x
x
e7
e8
e9
e10
x
x
x
x
x
x
x
e11
e12
e13
- Let F e0,e1,. be a set of computational
events, one from each process. F is the frontier
of a consistent cut iff the events in F are
concurrent. - Proof from Property (I) and Property(0).
8Properties of consistent cuts (III)Lattice of
consistent cuts
C2
C1
e1
e2
e3
x
x
x
e0
x
e6
e4
e5
x
x
x
e7
e8
e9
e10
x
x
x
x
x
x
x
e11
e12
e13
9Un-coordinated check-pointing
- Each process saves its local state at start, and
then whenever it wants. - Events compute,send,receive,take check-point
- Recovery line frontier of any consistent cut,
whose events are all check-points - Is there an optimum recovery line? How do we find
it?
10Check-point Dependency Graph
p
q
r
p
q
r
- Nodes
- One for each local check-point
- One for current state of each surviving process
- Edges one for each message (e,e) from some P to
Q - Source is node for last check-point on P that
happened before e - Destination is node n on Q for first
check-point/current state such that e happened
before n
11Properties of check-point dependency graph
p
q
r
- Node c2 is reachable from node c1 in graph iff
- check-point corresponding to c1 happens before
- check-point corresponding to c2.
12Finding optimum recovery line
RL1
RL2
RL3
RL0
p
q
r
- RL0 last nodes on each process
- While (there exist u,v in RLi v is reachable
from u) - RLi1 RLi v node before v in same
process as v - Final RL when loop terminates is optimum recovery
line - See later to make this into an algorithm.
13Correctness
p
q
r
- Algorithm obviously computes a set of concurrent
check-points, one from each process. - From Property (II), it follows that these
check-points are frontier of a consistent cut.
14Optimality
p
q
r
- Suppose O is better recovery line.
- O cannot be RLO otherwise, our algorithm
succeeds. So RL0 is better than O. - Consider iteration when RLi is better than O but
RLi1is not. There exist u,v in RLi such that v
is reachable from u and RLi1 is obtained from
Rli by dropping v and taking check-point prior to
v. Therefore, v must be in O. Let x in O be
check-point on same process as u. We see that
x?u?v, which contradicts Property(II).
15Finding recovery line efficiently
p
q
r
- Node colors
- Yellow on current recovery line
- Red beyond current recovery line
- Green behind current recovery line
- Bad edge
- Source is red/yellow
- Destination is yellow/green
- Algorithm propagate redness forward from
destination bad edges
16Algorithm
- Mark all nodes green
- For each node l that is last node of process
- Mark node yellow
- Add each edge (l,d) to worklist
- While worklist is nonempty do
- Get edge (s,d) from worklist
- If color(d) is red continue
- L node to left of d
- Mark L yellow Add all bad edges (L,d) to
worklist - R first red node to right of d
- For each node t in interval d,R)
- Mark t red
- Add all bad edges of form (t,d) to worklist
17Remarks
- Complexity of algorithm O(EV)
- Each node is touched at most 3 times to mark it
green, yellow,red - Each edge is examined at most twice
- Once when its source goes green? yellow
- Once when its source goes yellow ? red
- Another approach use rollback dependency graph
(see Alvisi et al)
18Practical details
- Each process numbers its checkpoints starting at
0. - When a message is sent from S to R, number of
last check-point is piggybacked on message. - Receiver of message saves message piggyback in
log. - When checkpoint is taken, message log is also
saved on disk. - In-flight messages can be recovered from this log
after recovery line has been established.
19Garbage collection of saved states
- Garbage collection of old states is key problem.
- One solution run the recovery line algorithm
once in a while even if there is no failure, and
GC all states behind the recovery line.
20Application-level Check-pointing
21Recall
- We have seen system-level check-pointing.
- Trouble with system-level check-pointing
- lot of data saved at each check-point
- PC, registers, stack, heap, some O/S
state,network state, - thin pipe to disk problem
- lack of portability
- processor/OS state is very implementation-specific
- cannot restart check-point on different platform
- cannot restart check-point on different number of
processors - One alternative application-level check-pointing
22Application-level check-pointing
- Key idea permit user to specify
- what variables should be saved at a check-point
- program point where check-point should be taken
- Example protein-folding
- save only positions and velocities of bases
- check-point at end of time-step
- Advantages
- less data saved
- only live data needs to be saved
- check-point at program points where live data is
small and no in-flight messages - data can be saved in implementation-independent
manner -
-
23Warning
- This is more complex than it appears!
- We must restore
- PC need to save where check-point was taken
- registers
- stack
- In general, many active procedure invocations
when check-point is taken. - How do we restore stack so procedure returns etc.
happen correctly? - Heap restored heap data will be in different
locations than at check-point
24Right intuition
- In application-level check-pointing, we must use
the saved variables to recompute the system state
we would have saved in system-level
check-pointing, modulo relocation of heap
variables. - Recovery script
- code that is executed to accomplish this
- distinct from user code, but obviously derived
from it - however, needs to woven into user code to
simplify problems such as register restoration
25Example DOME (Beguelin et al,CMU)
- Distributed Object Migration Environment (DOME)
- C library of data parallel objects
automatically distributed over networks of
heterogenous work-stations - Application-level check-pointing and restart
supported - User-level
- Pre-processor based
26Simple case
- Most computation occurs in a loop in main
- Solution
- put one check-point at bottom of loop
- live variables at bottom of loop are globals
- write script to save and restore globals
- weave script into main
27Dome example
- main (int argc, char argv)
- dome-init(argc,argv)
- // statements are introduced for failure
recovery - //prefix d on variable type says save me
at checkpoint - dScalarltintgt integer-variable
- dScalarltfloatgt float-variable
- dVectorltintgt int-vector
- if (! is_dome_restarting())
- execute_user_initialization_code()
- while (!loop_done())
- //loop_done uses only saved variables
- do_computation()
- dome_check_point()
-
-
28Analysis
- Let us understand how this code restores
processor state - PC we drop into loop after restoring globals
- registers by making recovery script part of
main, we ensure that register contents at top of
loop are same for normal execution and for
restart - stack we re-execute main, so frame is restored
- heap restored from saved check-point but may be
relocated - Think this works even if we restart on different
machine!
29Remarks
- Loop body is allowed to make function calls
- real restriction is that there is one check-point
and it must be in main - Command-line parameter is used to determine
whether execution is normal or restart - User must write some code to restore variables
from check-point - perhaps library code can help
30More complex example
- f()
- dScalarltintgt i
- do_f_stuff
- g(i)
- next_statement
-
-
- g(dScalarltintgt I)
- do_g_stuff_1
- dome_checkpoint()
- do_g_stuff_2
-
-
31General scenario
- Check-point could happen deep inside a bunch of
procedure calls. - On restart, we need to restore stack so procedure
returns etc. can happen normally. - Solution save information about which procedure
invocations are live at check-point
32Example with Dome constructs
- f()
g(dScalarltintgt I) - dScalarltintgt i
if (is_dome_restarting())
- if (is_dome_restarting())
goto restart_done - next_call dome_get_next_call()
do_g_stuff_1 - ..
dome_checkpoint() - do_f_stuff
restart_done - dome_push(g1)
do_g_stuff_2 - g1
- g(i)
- dome_pop()
- next_statement
-
-
-
-
33Challenge
- Do this for MPI code.
- Can compiler determine
- where to check-point?
- what data to check-point?
- Need not save all data live at check-point
- if some variables can be easily recomputed from
saved data and program constants, we can
re-compute those values in the recovery script. - we can modify program to make this easier.
- Measure of success beat hand-written recovery
code