SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery - PowerPoint PPT Presentation

About This Presentation

Title:

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery

Description:

Improving the Availability of. Shared Memory Multiprocessors with ... Output commit problem Can't send uncommitted data beyond sphere of recoverability ... – PowerPoint PPT presentation

Number of Views:16

Avg rating:3.0/5.0

Slides: 38

Provided by: daniel83

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery

1
SafetyNet Improving the Availability of Shared
Memory Multiprocessors with Global
Checkpoint/Recovery

Daniel J. Sorin, Milo M. K. Martin,
Mark D. Hill, and David A. Wood
Computer Sciences Department
University of WisconsinMadison

2
Overview

Hardware fault frequencies are increasing
Hardware checkpoint/recovery for multiprocessors
Transparent to software
SafetyNet Innovations
Efficient coordination of checkpoint creation
Optimized logging of checkpoint state
Checkpoint validation off critical path
SafetyNet achieves 3 goals, existing systems get
2
High availability
High performance
Low cost

3
Outline

Availability
Motivation
Example targeted faults
Differences between SafetyNet and existing
approaches
SafetyNet Key Features
A SafetyNet Implementation
Evaluation
Conclusions

4
Availability Motivation

Fault frequencies are increasing
Technological reasons
Smaller transistors
Denser wires
Architectural reasons
More components
More aggressive designs
Marketing trends demand more availability
Need architectural solution to improve
availability

5
Which Faults Do We Target?

Hardware faults in shared memory multiprocessors
Mostly transient, some permanent, not chipkill
We focus on faults outside of processor cores
Why? Good techniques for processors (e.g., DIVA)
Interconnection network
Example dead switch
Detect with timeout
Cache coherence protocols
Example lost coherence message
Detect with timeout

6
System Hardware Design Space
Existing systems get only 2 out of 3 features
High Availability
Backward Error Recovery (Tandem NonStop)
Forward Error Recovery (IBM mainframes)
High Performance
Low Cost
Servers and PCs
7
Outline

Availability
SafetyNet Key Features
System abstraction
Innovations
A SafetyNet Implementation
Evaluation
Conclusions

8
SafetyNet Abstraction
Most Recently Validated Checkpoint
Processor
Current Memory Checkpoint
Recovery Point
Current Memory checkpoint
Current Memory Version
Active (Architectural) State of System
Processor
Checkpoints Awaiting Validation
9
SafetyNet Execution Model
CP1
recovery pt
CP2
validating
CP3
active
CP4
CP5
time
10
SafetyNet Goal and Innovations

Goal Recover to consistent checkpoint if fault
Inefficient but correct solution
Periodically quiesce entire system to take
checkpoint
Checkpoints include all system state
Stop system to validate checkpoints as fault free
SafetyNet innovations
Efficient coordination of checkpoint creation
across system
Optimized checkpointing of system state
Pipelined validation of checkpoints in background

11
Key 1Coordinating Checkpoint Creation

Checkpoints must reflect consistent system state
Nodes must agree on memory values and coherence
Coordinate checkpoints in logical time
Logical time is time base that respects causality
Each node maintains its own logical clock
Create checkpoint every K logical cycles
We need logical time base that helps coordination

12
Logical Time Base

Many logical time bases exist
Depends on coherence protocol
Broadcast snooping systems
Increment clock for every coherence request
processed
Nodes can be at different logical times
All nodes can agree when coherence transaction
happens
Directory protocol systems
Based on loosely synchronized physical clock (10
kHz)
More complicated explanation ? refer to paper for
details

13
Key 2Optimized Checkpointing of System State

Checkpoint all state needed to resume execution
Processor registers
Memory state (including cache state)
Cache coherence state
Processors save register state at each checkpoint
Copy registers into shadow registers
Logically, cache/memory log old data every time
Store overwrites an old checkpoint of block
Blocks coherence ownership is transferred
How can we reduce the amount of logged state?

14
Optimized Logging

Insight only recover at checkpoint granularity
Intervals between checkpoints group
writes/transfers
E.g., checkpoint every 100,000 cycles (100 µsec
at 1GHz)
Only log first store/transfer per block per
interval
Optimization at cache
Label cache blocks with checkpoint numbers (CNs)
If write/transfer is from same checkpoint, no
logging needed
Large benefit due to locality of references

15
Key 3Checkpoint Validation in Background

Only validate when all agree checkpoint is
fault-free
Example no outstanding coherence requests in
checkpoint
Nodes perform fault detection, then coordinate
Can be in background and pipelined
Reason why we have checkpoints awaiting
validation
Can hide long fault detection latencies
Number of outstanding checkpoints x checkpoint
length
Design tolerance to be longer than longest
detection latency
Dont slow down execution to validate checkpoints

16
Outline

Availability
SafetyNet Key Features
A SafetyNet Implementation
Evaluation
Conclusions

17
System Model
CPU
reg CPs
CLB
CLB
memory
cache(s)
network interface
NS half switch
I/O bridge
EW half switch

Checkpoint Log Buffer (CLB) at cache and memory
Just FIFO log of block writes/transfers

18
Example of SafetyNet Operation
Regs CP2
Regs CP3
Cache
Cache
B
M
2000
Addr State CN data
Addr State CN data
CLB
CLB
Addr State data
Addr State data
Interconnection network
Recovery point is checkpoint 2. Most recent
checkpoint is 3. Active checkpoint is 4.
Processor 1 owns block B (validated).
19
Example of SafetyNet Operation
Regs CP2
Regs CP2
Regs CP3
Regs CP3
Cache
Cache
B
M
4
3000
Addr State CN data
Addr State CN data
B
M
2000
CLB
CLB
Addr State data
Addr State data
Interconnection network
P1 stores 3000 to block B between checkpoints 3
and 4. Logs old data.
20
Example of SafetyNet Operation
Regs CP2
Regs CP2
Regs CP3
Regs CP3
Cache
Cache
B
M
4
3000
Addr State CN data
Addr State CN data
B
M
2000
CLB
CLB
Addr State data
Addr State data
Interconnection network
P1 loads from block B. SafetyNet uninvolved.
21
Example of SafetyNet Operation
Regs CP2
Regs CP2
Regs CP3
Regs CP3
Regs CP4
Regs CP4
Cache
Cache
B
M
4
3000
Addr State CN data
Addr State CN data
B
M
2000
CLB
CLB
Addr State data
Addr State data
Interconnection network
Coordinated creation of checkpoint 4. Active
checkpoint is 5. Save register state at beginning
of checkpoint 4.
22
Example of SafetyNet Operation
Regs CP2
Regs CP2
Regs CP3
Regs CP3
Regs CP4
Regs CP4
Cache
Cache
B
M
5
3000
Addr State CN data
Addr State CN data
B
M
2000
CLB
CLB
B
M
3000
Addr State data
Addr State data
Interconnection network
P2 requests ownership of block B. P1 logs old
data and sends copy to P2. P1 invalidates cache
entry.
23
Example of SafetyNet Operation
Regs CP3
Regs CP3
Regs CP4
Regs CP4
Cache
Cache
B
M
5
3000
Addr State CN data
Addr State CN data
B
M
2000
CLB
CLB
B
M
3000
Addr State data
Addr State data
Interconnection network
Validation of checkpoint 3. Discard checkpoint 2
registers. Recovery point is now beginning of
checkpoint 3.
24
Example of SafetyNet Operation
Regs CP3
Regs CP3
Cache
Cache
B
M
2000
Addr State CN data
Addr State CN data
CLB
CLB
Addr State data
Addr State data
Interconnection network
Recovery (to checkpoint 3). Restore CP3
registers. Restore ownership of B to P1.
Invalidate B at P2. Now restart system!
25
System Recovery and Restart

Any component can trigger recovery
E.g., processor times out on coherence request
All in-progress transactions are dropped
By definition, these transactions are not
validated
After recovery, resume execution
May have to reconfigure (e.g., route around dead
link)
Must replay work that was lost

26
I/O and the Outside World

Output commit problem Cant send uncommitted
data beyond sphere of recoverability
SafetyNet includes processors, memory, coherence
Doesnt include network, disks, printer, etc.
Standard solution wait to communicate with I/O
Only send validated data to outside world
Input commit problem Input cant be recovered
Standard solution log input

27
Outline

Availability
SafetyNet Key Features
A SafetyNet Implementation
Evaluation
Methodology
Runtime performance
Conclusions

28
Methodology Simulation Workloads

Simulation
Simics full-system simulation of 16-proc SPARC
system
Detailed timing simulation of memory system
MOSI directory cache coherence protocol
Simple, in-order processor model
128KB L1I/D, 4MB L2, 512KB CLB
Workloads (commercial and scientific)
Online transaction processing (OLTP) IBMs DB2
Static web server Apache driven by SURGE
Dynamic web server Slashcode
Java server SpecJBB
Scientific barnes-hut from SPLASH2

29
Runtime Performance
Normalize results to unprotected system
30
Runtime Performance
Unprotected system crashes if fault occurs
31
Runtime Performance
Error bars /- one standard deviation
SafetyNet has same fault-free performance as
unprotected
32
Runtime Performance
SafetyNet avoids crashes in presence of lost
messages
33
Runtime Performance
SafetyNet avoids crashes in presence of dead
half-switch
34
High-Level Comparison to ReVive
ReVive SafetyNet
Backward error recovery scheme Yes Yes
Fault model Transient permanent Transient some permanent
Processor modification No Yes
Software modification Minor None
Fault-free performance 6-10 loss No loss
Output commit latency At least 100 milliseconds No more than 0.4 milliseconds
35
Conclusions