Designing High Availability Systems with Commodity Hardware and Software

About This Presentation

Title:

Designing High Availability Systems with Commodity Hardware and Software

Description:

Reconfiguration to salvage fault free components Need to be provide full coverage ... Addresses tension between fully shared and fully isolated. High ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 89

Provided by: homepage6

Category:

more less

Transcript and Presenter's Notes

Title: Designing High Availability Systems with Commodity Hardware and Software

1
Designing High Availability Systems with
Commodity Hardware and Software

Nidhi Aggarwal

2
Research Overview

Design high availability systems for commodity
market

Resource efficiency to avoid full duplication
3
Problem Statement

High availability currently a feature in high-end
servers
Employ specialized and proprietary
hardware/software
Trend towards more, but less reliable,
transistors
Design for availability important for a diverse
set of systems
Need high availability (HA) for mid-range systems

Borkar et al., 2004
4
Constraints (1 of 2)

General-purpose hardware
Off-the-shelf software
Availability on demand
Non-HA applications should not be penalized

Need to be commodity based and on-demand
5
Constraints (2 of 2)

Need to handle soft errors
Fault Isolation to prevent correlated errors
Need to handle hard errors
Reconfiguration to salvage fault free components

Need to be provide full coverage
6
Adapting Commodity systems Challenges

Hardware
Commodity hardware designed with shared resources
Engineering issues in using available redundancy
in CMPs
Full duplication not cost-effective
Software
Commodity OS change timeframe too long
Changes need to be ported to various OSes

7
Prior approaches not applicable

NonStop, Stratus Redundant chips, I/O comparison
Expensive especially with future CMPs
System software managed redundancy problematic
Intrusive changes required to OS (NonStop kernel,
VOS)
Restricted to applications targeted to particular
niche OS
zSeries Redundant pipelines, instruction
comparison
Tight lock-stepping not scalable
Latency of comparison, local error correction
etc.
Imposes constant overhead for all applications
Requires custom hardware design

8
Key Contributions
Todays talk
Configurable Isolation - Transient fault
detection - Reconfiguration - Commodity CMP
Transparent Redundancy - VMM based redundancy
- Off-the-shelf OS - Low overhead
ISCA, SELSE 2007 IEEE Computer 2007 2 Patents
Filed 2007
ASPLOS 2008 (poster) SELSE 2008 HP Tech Report
2008
Resource Efficiency - Duplication cache -
Selective availability - Adaptive availability

Power Efficiency
- Dynamic reprovisioning
- Power effic. checkpoints
- DRAM speculation

SELSE, ISCA 2007 HPCA 2008 2 patents filed 2007,
2008
SELSE 2008 2 patents filed 2008
9
Research Components
Configurable Isolation
10
Research Focus

Loose-lockstepped systems (NonStop, Stratus)
Tight lock-stepping (zSeries) is becoming harder
Easily extensible to commodity hardware
I/O level comparison
No intrusive changes inside the chip
Comparison bandwidth required is low

11
Conventional Multi-Core System Shared

Commodity hardware
Emphasis on shared resources
No fault isolation
Correlated errors can go undetected even with TMR
cores
No reconfiguration support
One failure can be catastrophic

12
Design Alternatives Full Isolation

Non-commodity design
Static partitioning of resources
Fault isolation per slice
No correlated errors
Limited reconfiguration
Only at slice level

13
Configurable Isolation

Techniques that provide optional logical fault
isolation for shared components
Addresses tension between fully shared and fully
isolated
High performance mode
CMP resources shared for maximum utilization
High availability mode
Degree of sharing configurable for selective
isolation

14
High Availability Mode

Split resources into domains
Map domains to diff. colors
units of fault containment
Configurable no. of colors
Each cache bank can store any line (subject to
configuration)

15
Transient Fault Isolation

Loose lock-step design
Resources split in two domains
Loose lock-step software stack
Voters in I/O hub or hypervisor
Enables isolation for transient fault detection

16
Reconfiguration

Map out faulty components

17
Reconfiguration

Map out faulty components

18
Reconfiguration

Map out faulty components

19
Reconfiguration

Map out faulty components

20
Enhanced Commodity CMPs

Enable fault isolation
Ring and bank addressing
Cross links and input multiplexers
Ring configuration unit (RCU)
Self checked logic
Enable reconfiguration
Mode and tag bits to enable caching lines from
any bank
Extra tag bits required is log2(number of banks)
e.g., 3 extra bits in 512 bit line (for 8 banks)
Expected overhead minor (area, timing and power)

Mux
Mux
Mux
Mux
21
Evaluation Methodology

Impact of hard faults
Simulate system compute capacity over 100,000
hours
3 systems shared, full isolation configurable
isolation
Three workloads (based on cache requirements)
Large memory gap, apsi, swim, mcf
Mixed memory applu, perlbmk, fma3d, crafty
Small memory vertex, equake, facerec, mesa
State-of-the-art industrial fault model (from HP)
FIT rates and distributions for various components

22
Measure Impact of Faults

Effect of faults on the three architectures
Conventional Faults lead to loss of entire
system
Full Isolation Faults lead to loss of slice
Configurable Isolation Faults lead to
reconfiguration
Performance degradation based on type of fault
Architecture experiences a sequence of faults
Re-configuration due to each fault
New configurations throughput assigned till next
fault

23
Two Phase Simulation Methodology

1st phase Exhaustive simulation of all
configurations
Determines each configurations throughput
Full system x86 simulation
2nd phase Monte Carlo simulation
Determines fault sequences
Generate expected fault time per component
Inject faults during each run of Monte Carlo
simulation
10,000 runs for 100,000 simulated hours each

24
Results (Mixed Memory Workload)
29
33
25
Configurable Isolation Summary

Configurable Isolation
Enables transient fault detection
Enables reconfiguration
Requires non-intrusive changes

26
Research Components
Transparent Redundancy
27
System software based redundancy

Specialized OS NonStop kernel, Stratus VOS
Create process pairs Use middleware targeted to
OS
Manage replicas as lock-stepped state machines
Same inputs at same time ? Same outputs unless
error
Sources of input non-determinism (single
threaded)
Non-deterministic instructions
Force same output at replicas
Asynchronous inputs
Force delivery at same instruction in replicas

Specialized functionality with intrusive OS
changes
28
Virtualization can help

VMM operation fundamentally based on
Ability to identify relevant events
Transform events to hide non-native execution

VMM can create and synchronize replicas
transparently
29
Transparent Redundancy Operation

Initialization VMM boot up
VMM replicates high availability VMs
Replicas execute deterministic instructions
natively
Transfer to VMM for events that need
synchronization
Periodic control transfer to VMM
Instruction decrement counter
Timer interrupt
Hypervisor based fault tolerance (fail stop
hardware)
CMP based approach with focus on soft error
detection

Bressoud et al., ACM transaction on computers 96
30
Transparent Redundancy Operation
IPI Inter processor Interrupt
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged non-deterministic instruction ?
Apply buffered result from R
31
Transparent Redundancy Operation
ND Non deterministic
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
32
Transparent Redundancy Operation
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
PC?
Interrupt (Asynchronous) ? Handshake
Handshake
Handle Interrupt
Handle Interrupt
33
Transparent Redundancy Operation
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
PC?
Interrupt (Asynchronous) ? Handshake
Handshake
Handle Interrupt
Handle Interrupt
I/O ? Wait Vote
I/O ? Vote
V
34
Transparent Redundancy Operation
35
Evaluation Methodology

Analytical Model calibration experiments

36
Model parameters

VMwares Replay log to calculate event frequency
Replay and Retrace software in VMware
Workstation
Logs all non-deterministic events and outcomes
Uses log to deterministically replay a VM later
Experiments using Xen and Linux
Measure overhead of event handling
Validate key functionality

VMware academic partnership
37
Synchronization Overhead
Synchronization overheads are small 3-14
38
VMM based transparent availability

VMM based transparent availability promising
Synchronization overheads are small (3-14)
Can support recovery from hard errors in software
Recovery from soft errors requires checkpoints
Checkpoint overhead in software large Xen,
VMware
Hardware checkpoints can help (6) Revive I/O

Nakano et al., HPCA 2006
39
Summary

High availability increasingly important
Configurable Isolation
Isolation to enable effectively 100 soft error
detection
Ability to reconfigure after hard faults
Non-intrusive changes to general purpose hardware
Transparent Redundancy
Transparent availability for off-the-shelf
software
Selective redundancy for high availability
applications
Resource efficiency important to avoid full
duplication

40
Questions

Thanks!

41
Bonus Slides
42
Transparent Redundancy Summary

Transparent Redundancy
Can enable availability for off-the-shelf OS
Small synchronization overhead

43
HA system with commodity blocks
VM
Replica VM
Applications
Applications
OS
OS
VMM
Replica Synchronization
Resources are still duplicated!
44
Duplication Overheads

Core and logic duplication
Can be used selectively in VMM based systems
Redundancy can be selected on a per VM basis
Expensive but Moores law helps
Memory Duplication
Memory cost in servers is significant
Loose lock-stepped systems fully duplicate memory
Memory hierarchy not shared across boards

45
Reducing memory duplication overheads

Shared resources in CMPs and VMM memory
management present an opportunity
If we can still ensure fault detection and
recovery
Key insights
Computational errors only propagate to written
pages
Read only pages are susceptible to memory errors
ECC can help detect and correct memory errors
Written pages are a small fraction of memory
footprint

Duplicate pages that are written, share read only
pages
46
Partial duplication
Red Replica Green Replica Blue Shared

Loose lockstepped CMPs can have partial
duplication

Minor RCU modification to enable sharing of read
only requests
47
Duplication Cache

Number of written pages varies by application
Worst case design would need 100 duplication
Software-managed page cache Write Working Set
Initially mark pages Read Only
Dynamically duplicate on Write
Manage cache replacement as Least Recently
Written
Pages replaced from cache are compared with
original
If match discard replica, mark original read
only
If error start recovery
Small performance impact
page comparisons
soft page faults

48
IPC degradation vs. Memory Duplication
Workloads constructed as combination of SPEC
benchmarks
49
Summary

High availability increasingly important
Configurable Isolation
Isolation to enable effectively 100 soft error
detection
Ability to reconfigure after hard faults
Non-intrusive changes to general purpose hardware
Transparent Redundancy
Transparent availability for off-the-shelf
software
Selective redundancy for high availability
applications
Resource efficiency to avoid full duplication

50
Future Directions

Hardware
Reconfiguration policies Topology, workload
requirements
Intelligent use of heterogeneity
Software
Detailed evaluation of transparent redundancy
Multithreaded redundancy

51
Research Interests

Efficient Data Center Design
Conflicting constraints
Power
Performance
Reliability, Availability
Quality of Service, Service Level Agreements
Manageability
Conflicting solutions Hardware and multiple
software levels
Which level is best suited?
Can virtual machines help?
Enable adaptive, transparent and integrated
resource management

52
Hypervisor based fault tolerance

Relies on fail stop hardware
Based on conventional SMP not CMP
Overheads extremely high due to ethernet
controller
Only considered hard errors
No voting or synchronization to detect soft
errors
Static epoch based protocol
Epoch length determined interrupt handling
latency
No selective availability
Full duplication of memory

Bressoud, ACM transaction on computers 96
53
Support required for transparent redundancy

Instruction decrement counter (readable/writeable)
Trap on user non-deterministic instructions
Configurable isolation

54
Replay

Replay and Retrace software in Workstation
Logs all non-deterministic events and outcomes
Uses log to deterministically replay a VM later

55
Other research

Memory System Design
Fair Queuing Memory Systems (MICRO 2006)
Power Efficient DRAM speculation (HPCA 2008)
Intrinsically Compatible Process VMs (UW-TR 2006)

56
COVERT Operation
Applications
Operating System
Deliver Interrupt
Asynchronous interrupt
VMM 1
Interrupt synchronization module
Send instruction counter
Ack / Instruction counter
Replica VM
57
COVERT Operation
Applications
Operating System
Translated OS call
OS response
Nondeterministic OS call
VMM
OS call emulation module
Translated OS response
Replica VM
58
COVERT Operation
Applications
Operating System
Checkpoint system state
Checkpoint granularity
VMM
Checkpoint module
Replica VM
59
COVERT Operation
Applications
Operating System
Reconfigure and restart from checkpoint
Miscomparison
VMM 1
Recovery module
Permanent fault
Replica VM
60
COVERT Operation
Applications
Operating System
Restart from checkpoint
Miscomparison
VMM 1
Recovery module
Replica VM
61
Fault data

Weibull distribution
Decreasing hazard rate
Inverse function
X ?(-ln (U))1/ß
Different data for hard and soft errors
Ratio of FIT rates between components important

62
Fault types

Faults handled
Transient fault in core logic
Transient fault in system component logic
Permanent faults in components
Caused by wear out or manufacturing defect
Latent defects
Handle stuck at, bridging, open, delay
Faults not handled
Power delivery faults
Open circuit defects or shorts that can affect
entire chip
Rarer because they require much higher
temperature conditions

63
Why SPEC?

Single threaded
Diverse memory characteristics
Not speeding up the workload
Isolate performance difference from architecture
False positives in multithreaded execution
NonStop doesnt use multithreaded applications

64
Simulation parameters
65
Component Replacements (Mixed Memory)
1.0
0.8
Shared
0.6
Full Isolation
Normalized component replacements
0.4
Configurable
Isolation
0.2
0.0
10 Degradation
25 Degradation
50 Degradation
66
Component Replacements
67
Dynamic Power Reprovisioning

Configurable Isolation enables new optimizations
Dynamic power reprovisioning
Suitable for systems with chip wide thermal
budget
Reassign power allotment of deconfigured
components
Constraints
Voltage supply
Thermal budgets
10 extra performance over configurable isolation

68
Systems with only hard error coverage
0.40
69
FIT Rate 10X
70
Component results - Shared
71
Component results - Configurable
72
Power Reprovisioning
73
Summary of current CMP availability features

Cores
Soft error detection limited to register files
with ECC
No fault isolation in Opteron, Xeon and Niagara
Limited isolation in Montecito and Power5 in
electrical and logical partitions
All architectures susceptible to soft errors in
logic
Montecito in lock-step configuration is an
exception

74
Summary of current CMP availability features

Caches
All architectures share at least one level of
cache
Resilient to errors in array
ECC or parity checks at all cache levels
No tolerance to multi-bit errors in Opteron and
Xeon
Niagara, Power5 Montecito handle some classes
of multi-bit errors
No tolerance to errors in cache circuitry or
interconnect
Entire socket susceptible to transient fault in
cache controller state machine

75
HP NonStop
76
zSeries
77
Cache
78
Core
79
Summary of current CMP availability features

Memory
Most fault tolerant resource
All conditions for high availability typically
satisfied in arrays
Sophisticated techniques present in all
architectures
Chip kill, background scrubbing, DIMM sparing
Memory access control circuitry unprotected
Some architectures better than others
Shared Northbridge more vulnerable than on-chip
controllers

80
Availability and Reliability

Availability MTBF/(MTBF MTTR)
Probability that system is operating when
required
Reliability MTTF
Probability that components will work for a
period of time within some confidence interval
Reliability Characteristic of device
Availability Characteristic of device system
usage scenarios repair time
Serviceability Early diagnosis of errors to
avoid downtime IBM call for repair

81
Component Replacements
Component replacements reduced by 60-100
82
Availability Sensitive Systems
83
Synchronization overhead (breakdown)
84
Availability and Reliability

Availability MTBF/(MTBF MTTR)
Probability that system is operating when
required
Reliability MTTF
Probability that components will work for a
period of time within some confidence interval
Reliability Characteristic of device
Availability Characteristic of device system
usage scenarios repair time
Serviceability Early diagnosis of errors to
avoid downtime IBM call for repair

85
Memory Duplication Overheads