Title: Designing High Availability Systems with Commodity Hardware and Software
1Designing High Availability Systems with
Commodity Hardware and Software
2Research Overview
- Design high availability systems for commodity
market
Resource efficiency to avoid full duplication
3Problem Statement
- High availability currently a feature in high-end
servers - Employ specialized and proprietary
hardware/software - Trend towards more, but less reliable,
transistors - Design for availability important for a diverse
set of systems - Need high availability (HA) for mid-range systems
Borkar et al., 2004
4Constraints (1 of 2)
- General-purpose hardware
- Off-the-shelf software
- Availability on demand
- Non-HA applications should not be penalized
Need to be commodity based and on-demand
5Constraints (2 of 2)
- Need to handle soft errors
- Fault Isolation to prevent correlated errors
- Need to handle hard errors
- Reconfiguration to salvage fault free components
Need to be provide full coverage
6Adapting Commodity systems Challenges
- Hardware
- Commodity hardware designed with shared resources
- Engineering issues in using available redundancy
in CMPs - Full duplication not cost-effective
- Software
- Commodity OS change timeframe too long
- Changes need to be ported to various OSes
7Prior approaches not applicable
- NonStop, Stratus Redundant chips, I/O comparison
- Expensive especially with future CMPs
- System software managed redundancy problematic
- Intrusive changes required to OS (NonStop kernel,
VOS) - Restricted to applications targeted to particular
niche OS - zSeries Redundant pipelines, instruction
comparison - Tight lock-stepping not scalable
- Latency of comparison, local error correction
etc. - Imposes constant overhead for all applications
- Requires custom hardware design
8Key Contributions
Todays talk
Configurable Isolation - Transient fault
detection - Reconfiguration - Commodity CMP
Transparent Redundancy - VMM based redundancy
- Off-the-shelf OS - Low overhead
ISCA, SELSE 2007 IEEE Computer 2007 2 Patents
Filed 2007
ASPLOS 2008 (poster) SELSE 2008 HP Tech Report
2008
Resource Efficiency - Duplication cache -
Selective availability - Adaptive availability
- Power Efficiency
- - Dynamic reprovisioning
- - Power effic. checkpoints
- - DRAM speculation
SELSE, ISCA 2007 HPCA 2008 2 patents filed 2007,
2008
SELSE 2008 2 patents filed 2008
9Research Components
Configurable Isolation
10Research Focus
- Loose-lockstepped systems (NonStop, Stratus)
- Tight lock-stepping (zSeries) is becoming harder
- Easily extensible to commodity hardware
- I/O level comparison
- No intrusive changes inside the chip
- Comparison bandwidth required is low
11Conventional Multi-Core System Shared
- Commodity hardware
- Emphasis on shared resources
- No fault isolation
- Correlated errors can go undetected even with TMR
cores - No reconfiguration support
- One failure can be catastrophic
12Design Alternatives Full Isolation
- Non-commodity design
- Static partitioning of resources
- Fault isolation per slice
- No correlated errors
- Limited reconfiguration
- Only at slice level
13Configurable Isolation
-
- Techniques that provide optional logical fault
isolation for shared components - Addresses tension between fully shared and fully
isolated - High performance mode
- CMP resources shared for maximum utilization
- High availability mode
- Degree of sharing configurable for selective
isolation
14High Availability Mode
- Split resources into domains
- Map domains to diff. colors
- units of fault containment
- Configurable no. of colors
- Each cache bank can store any line (subject to
configuration)
15Transient Fault Isolation
- Loose lock-step design
- Resources split in two domains
- Loose lock-step software stack
- Voters in I/O hub or hypervisor
-
- Enables isolation for transient fault detection
16Reconfiguration
- Map out faulty components
17Reconfiguration
- Map out faulty components
18Reconfiguration
- Map out faulty components
19Reconfiguration
- Map out faulty components
20Enhanced Commodity CMPs
- Enable fault isolation
- Ring and bank addressing
- Cross links and input multiplexers
- Ring configuration unit (RCU)
- Self checked logic
- Enable reconfiguration
- Mode and tag bits to enable caching lines from
any bank - Extra tag bits required is log2(number of banks)
- e.g., 3 extra bits in 512 bit line (for 8 banks)
- Expected overhead minor (area, timing and power)
Mux
Mux
Mux
Mux
21Evaluation Methodology
- Impact of hard faults
- Simulate system compute capacity over 100,000
hours - 3 systems shared, full isolation configurable
isolation - Three workloads (based on cache requirements)
- Large memory gap, apsi, swim, mcf
- Mixed memory applu, perlbmk, fma3d, crafty
- Small memory vertex, equake, facerec, mesa
- State-of-the-art industrial fault model (from HP)
- FIT rates and distributions for various components
22Measure Impact of Faults
- Effect of faults on the three architectures
- Conventional Faults lead to loss of entire
system - Full Isolation Faults lead to loss of slice
- Configurable Isolation Faults lead to
reconfiguration - Performance degradation based on type of fault
- Architecture experiences a sequence of faults
- Re-configuration due to each fault
- New configurations throughput assigned till next
fault
23Two Phase Simulation Methodology
- 1st phase Exhaustive simulation of all
configurations - Determines each configurations throughput
- Full system x86 simulation
- 2nd phase Monte Carlo simulation
- Determines fault sequences
- Generate expected fault time per component
- Inject faults during each run of Monte Carlo
simulation - 10,000 runs for 100,000 simulated hours each
24Results (Mixed Memory Workload)
29
33
25Configurable Isolation Summary
- Configurable Isolation
- Enables transient fault detection
- Enables reconfiguration
- Requires non-intrusive changes
26Research Components
Transparent Redundancy
27System software based redundancy
- Specialized OS NonStop kernel, Stratus VOS
- Create process pairs Use middleware targeted to
OS - Manage replicas as lock-stepped state machines
- Same inputs at same time ? Same outputs unless
error - Sources of input non-determinism (single
threaded) - Non-deterministic instructions
- Force same output at replicas
- Asynchronous inputs
- Force delivery at same instruction in replicas
Specialized functionality with intrusive OS
changes
28Virtualization can help
- VMM operation fundamentally based on
- Ability to identify relevant events
- Transform events to hide non-native execution
VMM can create and synchronize replicas
transparently
29Transparent Redundancy Operation
- Initialization VMM boot up
- VMM replicates high availability VMs
- Replicas execute deterministic instructions
natively - Transfer to VMM for events that need
synchronization - Periodic control transfer to VMM
- Instruction decrement counter
- Timer interrupt
- Hypervisor based fault tolerance (fail stop
hardware) - CMP based approach with focus on soft error
detection
Bressoud et al., ACM transaction on computers 96
30Transparent Redundancy Operation
IPI Inter processor Interrupt
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged non-deterministic instruction ?
Apply buffered result from R
31Transparent Redundancy Operation
ND Non deterministic
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
32Transparent Redundancy Operation
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
PC?
Interrupt (Asynchronous) ? Handshake
Handshake
Handle Interrupt
Handle Interrupt
33Transparent Redundancy Operation
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
PC?
Interrupt (Asynchronous) ? Handshake
Handshake
Handle Interrupt
Handle Interrupt
I/O ? Wait Vote
I/O ? Vote
V
34Transparent Redundancy Operation
35Evaluation Methodology
- Analytical Model calibration experiments
36Model parameters
- VMwares Replay log to calculate event frequency
- Replay and Retrace software in VMware
Workstation - Logs all non-deterministic events and outcomes
- Uses log to deterministically replay a VM later
- Experiments using Xen and Linux
- Measure overhead of event handling
- Validate key functionality
VMware academic partnership
37Synchronization Overhead
Synchronization overheads are small 3-14
38VMM based transparent availability
- VMM based transparent availability promising
- Synchronization overheads are small (3-14)
- Can support recovery from hard errors in software
- Recovery from soft errors requires checkpoints
- Checkpoint overhead in software large Xen,
VMware - Hardware checkpoints can help (6) Revive I/O
Nakano et al., HPCA 2006
39Summary
- High availability increasingly important
- Configurable Isolation
- Isolation to enable effectively 100 soft error
detection - Ability to reconfigure after hard faults
- Non-intrusive changes to general purpose hardware
- Transparent Redundancy
- Transparent availability for off-the-shelf
software - Selective redundancy for high availability
applications - Resource efficiency important to avoid full
duplication
40Questions
41Bonus Slides
42Transparent Redundancy Summary
- Transparent Redundancy
- Can enable availability for off-the-shelf OS
- Small synchronization overhead
43HA system with commodity blocks
VM
Replica VM
Applications
Applications
OS
OS
VMM
Replica Synchronization
Resources are still duplicated!
44Duplication Overheads
- Core and logic duplication
- Can be used selectively in VMM based systems
- Redundancy can be selected on a per VM basis
- Expensive but Moores law helps
- Memory Duplication
- Memory cost in servers is significant
- Loose lock-stepped systems fully duplicate memory
- Memory hierarchy not shared across boards
45Reducing memory duplication overheads
- Shared resources in CMPs and VMM memory
management present an opportunity - If we can still ensure fault detection and
recovery - Key insights
- Computational errors only propagate to written
pages - Read only pages are susceptible to memory errors
- ECC can help detect and correct memory errors
- Written pages are a small fraction of memory
footprint
Duplicate pages that are written, share read only
pages
46Partial duplication
Red Replica Green Replica Blue Shared
- Loose lockstepped CMPs can have partial
duplication
Minor RCU modification to enable sharing of read
only requests
47Duplication Cache
- Number of written pages varies by application
- Worst case design would need 100 duplication
- Software-managed page cache Write Working Set
- Initially mark pages Read Only
- Dynamically duplicate on Write
- Manage cache replacement as Least Recently
Written - Pages replaced from cache are compared with
original - If match discard replica, mark original read
only - If error start recovery
- Small performance impact
- page comparisons
- soft page faults
48IPC degradation vs. Memory Duplication
Workloads constructed as combination of SPEC
benchmarks
49Summary
- High availability increasingly important
- Configurable Isolation
- Isolation to enable effectively 100 soft error
detection - Ability to reconfigure after hard faults
- Non-intrusive changes to general purpose hardware
- Transparent Redundancy
- Transparent availability for off-the-shelf
software - Selective redundancy for high availability
applications - Resource efficiency to avoid full duplication
50Future Directions
- Hardware
- Reconfiguration policies Topology, workload
requirements - Intelligent use of heterogeneity
- Software
- Detailed evaluation of transparent redundancy
- Multithreaded redundancy
-
-
51Research Interests
- Efficient Data Center Design
- Conflicting constraints
- Power
- Performance
- Reliability, Availability
- Quality of Service, Service Level Agreements
- Manageability
- Conflicting solutions Hardware and multiple
software levels - Which level is best suited?
- Can virtual machines help?
- Enable adaptive, transparent and integrated
resource management
52Hypervisor based fault tolerance
- Relies on fail stop hardware
- Based on conventional SMP not CMP
- Overheads extremely high due to ethernet
controller - Only considered hard errors
- No voting or synchronization to detect soft
errors - Static epoch based protocol
- Epoch length determined interrupt handling
latency - No selective availability
- Full duplication of memory
Bressoud, ACM transaction on computers 96
53Support required for transparent redundancy
- Instruction decrement counter (readable/writeable)
- Trap on user non-deterministic instructions
- Configurable isolation
54Replay
- Replay and Retrace software in Workstation
- Logs all non-deterministic events and outcomes
- Uses log to deterministically replay a VM later
55Other research
- Memory System Design
- Fair Queuing Memory Systems (MICRO 2006)
- Power Efficient DRAM speculation (HPCA 2008)
- Intrinsically Compatible Process VMs (UW-TR 2006)
56COVERT Operation
Applications
Operating System
Deliver Interrupt
Asynchronous interrupt
VMM 1
Interrupt synchronization module
Send instruction counter
Ack / Instruction counter
Replica VM
57COVERT Operation
Applications
Operating System
Translated OS call
OS response
Nondeterministic OS call
VMM
OS call emulation module
Translated OS response
Replica VM
58COVERT Operation
Applications
Operating System
Checkpoint system state
Checkpoint granularity
VMM
Checkpoint module
Replica VM
59COVERT Operation
Applications
Operating System
Reconfigure and restart from checkpoint
Miscomparison
VMM 1
Recovery module
Permanent fault
Replica VM
60COVERT Operation
Applications
Operating System
Restart from checkpoint
Miscomparison
VMM 1
Recovery module
Replica VM
61Fault data
- Weibull distribution
- Decreasing hazard rate
- Inverse function
- X ?(-ln (U))1/ß
- Different data for hard and soft errors
- Ratio of FIT rates between components important
62Fault types
- Faults handled
- Transient fault in core logic
- Transient fault in system component logic
- Permanent faults in components
- Caused by wear out or manufacturing defect
- Latent defects
- Handle stuck at, bridging, open, delay
- Faults not handled
- Power delivery faults
- Open circuit defects or shorts that can affect
entire chip - Rarer because they require much higher
temperature conditions
63Why SPEC?
- Single threaded
- Diverse memory characteristics
- Not speeding up the workload
- Isolate performance difference from architecture
- False positives in multithreaded execution
- NonStop doesnt use multithreaded applications
64Simulation parameters
65Component Replacements (Mixed Memory)
1.0
0.8
Shared
0.6
Full Isolation
Normalized component replacements
0.4
Configurable
Isolation
0.2
0.0
10 Degradation
25 Degradation
50 Degradation
66Component Replacements
67Dynamic Power Reprovisioning
- Configurable Isolation enables new optimizations
- Dynamic power reprovisioning
- Suitable for systems with chip wide thermal
budget - Reassign power allotment of deconfigured
components - Constraints
- Voltage supply
- Thermal budgets
- 10 extra performance over configurable isolation
68Systems with only hard error coverage
0.40
69FIT Rate 10X
70Component results - Shared
71Component results - Configurable
72Power Reprovisioning
73Summary of current CMP availability features
- Cores
- Soft error detection limited to register files
with ECC - No fault isolation in Opteron, Xeon and Niagara
- Limited isolation in Montecito and Power5 in
electrical and logical partitions - All architectures susceptible to soft errors in
logic - Montecito in lock-step configuration is an
exception
74Summary of current CMP availability features
- Caches
- All architectures share at least one level of
cache - Resilient to errors in array
- ECC or parity checks at all cache levels
- No tolerance to multi-bit errors in Opteron and
Xeon - Niagara, Power5 Montecito handle some classes
of multi-bit errors - No tolerance to errors in cache circuitry or
interconnect - Entire socket susceptible to transient fault in
cache controller state machine
75HP NonStop
76zSeries
77Cache
78Core
79Summary of current CMP availability features
- Memory
- Most fault tolerant resource
- All conditions for high availability typically
satisfied in arrays - Sophisticated techniques present in all
architectures - Chip kill, background scrubbing, DIMM sparing
- Memory access control circuitry unprotected
- Some architectures better than others
- Shared Northbridge more vulnerable than on-chip
controllers
80Availability and Reliability
- Availability MTBF/(MTBF MTTR)
- Probability that system is operating when
required - Reliability MTTF
- Probability that components will work for a
period of time within some confidence interval - Reliability Characteristic of device
- Availability Characteristic of device system
usage scenarios repair time - Serviceability Early diagnosis of errors to
avoid downtime IBM call for repair
81Component Replacements
Component replacements reduced by 60-100
82Availability Sensitive Systems
83Synchronization overhead (breakdown)
84Availability and Reliability
- Availability MTBF/(MTBF MTTR)
- Probability that system is operating when
required - Reliability MTTF
- Probability that components will work for a
period of time within some confidence interval - Reliability Characteristic of device
- Availability Characteristic of device system
usage scenarios repair time - Serviceability Early diagnosis of errors to
avoid downtime IBM call for repair
85Memory Duplication Overheads
- Tightly lock-stepped systems dont duplicate
memory
- Loose lock-stepped systems fully duplicate memory
86Partial duplication
87Ring Configuration Unit (Duplication)
88IBM Power 6 reconfiguration
- Deallocate cores
- Deallocate off chip L3 cache using on chip
directory - L2 cache error requires deallocation of all cores
associated with a cache bank - No deallocation of memory controllers etc.