Designing High Availability Systems with Commodity Hardware and Software - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

Designing High Availability Systems with Commodity Hardware and Software

Description:

Reconfiguration to salvage fault free components Need to be provide full coverage ... Addresses tension between fully shared and fully isolated. High ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 89
Provided by: homepage6
Category:

less

Transcript and Presenter's Notes

Title: Designing High Availability Systems with Commodity Hardware and Software


1
Designing High Availability Systems with
Commodity Hardware and Software
  • Nidhi Aggarwal

2
Research Overview
  • Design high availability systems for commodity
    market

Resource efficiency to avoid full duplication
3
Problem Statement
  • High availability currently a feature in high-end
    servers
  • Employ specialized and proprietary
    hardware/software
  • Trend towards more, but less reliable,
    transistors
  • Design for availability important for a diverse
    set of systems
  • Need high availability (HA) for mid-range systems

Borkar et al., 2004
4
Constraints (1 of 2)
  • General-purpose hardware
  • Off-the-shelf software
  • Availability on demand
  • Non-HA applications should not be penalized

Need to be commodity based and on-demand
5
Constraints (2 of 2)
  • Need to handle soft errors
  • Fault Isolation to prevent correlated errors
  • Need to handle hard errors
  • Reconfiguration to salvage fault free components

Need to be provide full coverage
6
Adapting Commodity systems Challenges
  • Hardware
  • Commodity hardware designed with shared resources
  • Engineering issues in using available redundancy
    in CMPs
  • Full duplication not cost-effective
  • Software
  • Commodity OS change timeframe too long
  • Changes need to be ported to various OSes

7
Prior approaches not applicable
  • NonStop, Stratus Redundant chips, I/O comparison
  • Expensive especially with future CMPs
  • System software managed redundancy problematic
  • Intrusive changes required to OS (NonStop kernel,
    VOS)
  • Restricted to applications targeted to particular
    niche OS
  • zSeries Redundant pipelines, instruction
    comparison
  • Tight lock-stepping not scalable
  • Latency of comparison, local error correction
    etc.
  • Imposes constant overhead for all applications
  • Requires custom hardware design

8
Key Contributions
Todays talk
Configurable Isolation - Transient fault
detection - Reconfiguration - Commodity CMP
Transparent Redundancy - VMM based redundancy
- Off-the-shelf OS - Low overhead
ISCA, SELSE 2007 IEEE Computer 2007 2 Patents
Filed 2007
ASPLOS 2008 (poster) SELSE 2008 HP Tech Report
2008
Resource Efficiency - Duplication cache -
Selective availability - Adaptive availability
  • Power Efficiency
  • - Dynamic reprovisioning
  • - Power effic. checkpoints
  • - DRAM speculation

SELSE, ISCA 2007 HPCA 2008 2 patents filed 2007,
2008
SELSE 2008 2 patents filed 2008
9
Research Components
Configurable Isolation
10
Research Focus
  • Loose-lockstepped systems (NonStop, Stratus)
  • Tight lock-stepping (zSeries) is becoming harder
  • Easily extensible to commodity hardware
  • I/O level comparison
  • No intrusive changes inside the chip
  • Comparison bandwidth required is low

11
Conventional Multi-Core System Shared
  • Commodity hardware
  • Emphasis on shared resources
  • No fault isolation
  • Correlated errors can go undetected even with TMR
    cores
  • No reconfiguration support
  • One failure can be catastrophic

12
Design Alternatives Full Isolation
  • Non-commodity design
  • Static partitioning of resources
  • Fault isolation per slice
  • No correlated errors
  • Limited reconfiguration
  • Only at slice level

13
Configurable Isolation
  • Techniques that provide optional logical fault
    isolation for shared components
  • Addresses tension between fully shared and fully
    isolated
  • High performance mode
  • CMP resources shared for maximum utilization
  • High availability mode
  • Degree of sharing configurable for selective
    isolation

14
High Availability Mode
  • Split resources into domains
  • Map domains to diff. colors
  • units of fault containment
  • Configurable no. of colors
  • Each cache bank can store any line (subject to
    configuration)

15
Transient Fault Isolation
  • Loose lock-step design
  • Resources split in two domains
  • Loose lock-step software stack
  • Voters in I/O hub or hypervisor
  • Enables isolation for transient fault detection

16
Reconfiguration
  • Map out faulty components

17
Reconfiguration
  • Map out faulty components

18
Reconfiguration
  • Map out faulty components

19
Reconfiguration
  • Map out faulty components

20
Enhanced Commodity CMPs
  • Enable fault isolation
  • Ring and bank addressing
  • Cross links and input multiplexers
  • Ring configuration unit (RCU)
  • Self checked logic
  • Enable reconfiguration
  • Mode and tag bits to enable caching lines from
    any bank
  • Extra tag bits required is log2(number of banks)
  • e.g., 3 extra bits in 512 bit line (for 8 banks)
  • Expected overhead minor (area, timing and power)

Mux
Mux
Mux
Mux
21
Evaluation Methodology
  • Impact of hard faults
  • Simulate system compute capacity over 100,000
    hours
  • 3 systems shared, full isolation configurable
    isolation
  • Three workloads (based on cache requirements)
  • Large memory gap, apsi, swim, mcf
  • Mixed memory applu, perlbmk, fma3d, crafty
  • Small memory vertex, equake, facerec, mesa
  • State-of-the-art industrial fault model (from HP)
  • FIT rates and distributions for various components

22
Measure Impact of Faults
  • Effect of faults on the three architectures
  • Conventional Faults lead to loss of entire
    system
  • Full Isolation Faults lead to loss of slice
  • Configurable Isolation Faults lead to
    reconfiguration
  • Performance degradation based on type of fault
  • Architecture experiences a sequence of faults
  • Re-configuration due to each fault
  • New configurations throughput assigned till next
    fault

23
Two Phase Simulation Methodology
  • 1st phase Exhaustive simulation of all
    configurations
  • Determines each configurations throughput
  • Full system x86 simulation
  • 2nd phase Monte Carlo simulation
  • Determines fault sequences
  • Generate expected fault time per component
  • Inject faults during each run of Monte Carlo
    simulation
  • 10,000 runs for 100,000 simulated hours each

24
Results (Mixed Memory Workload)
29
33
25
Configurable Isolation Summary
  • Configurable Isolation
  • Enables transient fault detection
  • Enables reconfiguration
  • Requires non-intrusive changes

26
Research Components
Transparent Redundancy
27
System software based redundancy
  • Specialized OS NonStop kernel, Stratus VOS
  • Create process pairs Use middleware targeted to
    OS
  • Manage replicas as lock-stepped state machines
  • Same inputs at same time ? Same outputs unless
    error
  • Sources of input non-determinism (single
    threaded)
  • Non-deterministic instructions
  • Force same output at replicas
  • Asynchronous inputs
  • Force delivery at same instruction in replicas

Specialized functionality with intrusive OS
changes
28
Virtualization can help
  • VMM operation fundamentally based on
  • Ability to identify relevant events
  • Transform events to hide non-native execution

VMM can create and synchronize replicas
transparently
29
Transparent Redundancy Operation
  • Initialization VMM boot up
  • VMM replicates high availability VMs
  • Replicas execute deterministic instructions
    natively
  • Transfer to VMM for events that need
    synchronization
  • Periodic control transfer to VMM
  • Instruction decrement counter
  • Timer interrupt
  • Hypervisor based fault tolerance (fail stop
    hardware)
  • CMP based approach with focus on soft error
    detection

Bressoud et al., ACM transaction on computers 96
30
Transparent Redundancy Operation
IPI Inter processor Interrupt
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged non-deterministic instruction ?
Apply buffered result from R
31
Transparent Redundancy Operation
ND Non deterministic
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
32
Transparent Redundancy Operation
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
PC?
Interrupt (Asynchronous) ? Handshake
Handshake
Handle Interrupt
Handle Interrupt
33
Transparent Redundancy Operation
Time
Start R
Start R
Privileged non-deterministic instruction ? Send
result to R
IPI
Privileged ND ? Apply buffered result from R
IPI
User ND ? Trap Send result
to R
User ND ? Trap Apply buffered Result from R
PC?
Interrupt (Asynchronous) ? Handshake
Handshake
Handle Interrupt
Handle Interrupt
I/O ? Wait Vote
I/O ? Vote
V
34
Transparent Redundancy Operation
35
Evaluation Methodology
  • Analytical Model calibration experiments

36
Model parameters
  • VMwares Replay log to calculate event frequency
  • Replay and Retrace software in VMware
    Workstation
  • Logs all non-deterministic events and outcomes
  • Uses log to deterministically replay a VM later
  • Experiments using Xen and Linux
  • Measure overhead of event handling
  • Validate key functionality

VMware academic partnership
37
Synchronization Overhead
Synchronization overheads are small 3-14
38
VMM based transparent availability
  • VMM based transparent availability promising
  • Synchronization overheads are small (3-14)
  • Can support recovery from hard errors in software
  • Recovery from soft errors requires checkpoints
  • Checkpoint overhead in software large Xen,
    VMware
  • Hardware checkpoints can help (6) Revive I/O

Nakano et al., HPCA 2006
39
Summary
  • High availability increasingly important
  • Configurable Isolation
  • Isolation to enable effectively 100 soft error
    detection
  • Ability to reconfigure after hard faults
  • Non-intrusive changes to general purpose hardware
  • Transparent Redundancy
  • Transparent availability for off-the-shelf
    software
  • Selective redundancy for high availability
    applications
  • Resource efficiency important to avoid full
    duplication

40
Questions
  • Thanks!

41
Bonus Slides
42
Transparent Redundancy Summary
  • Transparent Redundancy
  • Can enable availability for off-the-shelf OS
  • Small synchronization overhead

43
HA system with commodity blocks
VM
Replica VM
Applications
Applications
OS
OS
VMM
Replica Synchronization
Resources are still duplicated!
44
Duplication Overheads
  • Core and logic duplication
  • Can be used selectively in VMM based systems
  • Redundancy can be selected on a per VM basis
  • Expensive but Moores law helps
  • Memory Duplication
  • Memory cost in servers is significant
  • Loose lock-stepped systems fully duplicate memory
  • Memory hierarchy not shared across boards

45
Reducing memory duplication overheads
  • Shared resources in CMPs and VMM memory
    management present an opportunity
  • If we can still ensure fault detection and
    recovery
  • Key insights
  • Computational errors only propagate to written
    pages
  • Read only pages are susceptible to memory errors
  • ECC can help detect and correct memory errors
  • Written pages are a small fraction of memory
    footprint

Duplicate pages that are written, share read only
pages
46
Partial duplication
Red Replica Green Replica Blue Shared
  • Loose lockstepped CMPs can have partial
    duplication

Minor RCU modification to enable sharing of read
only requests
47
Duplication Cache
  • Number of written pages varies by application
  • Worst case design would need 100 duplication
  • Software-managed page cache Write Working Set
  • Initially mark pages Read Only
  • Dynamically duplicate on Write
  • Manage cache replacement as Least Recently
    Written
  • Pages replaced from cache are compared with
    original
  • If match discard replica, mark original read
    only
  • If error start recovery
  • Small performance impact
  • page comparisons
  • soft page faults

48
IPC degradation vs. Memory Duplication
Workloads constructed as combination of SPEC
benchmarks
49
Summary
  • High availability increasingly important
  • Configurable Isolation
  • Isolation to enable effectively 100 soft error
    detection
  • Ability to reconfigure after hard faults
  • Non-intrusive changes to general purpose hardware
  • Transparent Redundancy
  • Transparent availability for off-the-shelf
    software
  • Selective redundancy for high availability
    applications
  • Resource efficiency to avoid full duplication

50
Future Directions
  • Hardware
  • Reconfiguration policies Topology, workload
    requirements
  • Intelligent use of heterogeneity
  • Software
  • Detailed evaluation of transparent redundancy
  • Multithreaded redundancy

51
Research Interests
  • Efficient Data Center Design
  • Conflicting constraints
  • Power
  • Performance
  • Reliability, Availability
  • Quality of Service, Service Level Agreements
  • Manageability
  • Conflicting solutions Hardware and multiple
    software levels
  • Which level is best suited?
  • Can virtual machines help?
  • Enable adaptive, transparent and integrated
    resource management

52
Hypervisor based fault tolerance
  • Relies on fail stop hardware
  • Based on conventional SMP not CMP
  • Overheads extremely high due to ethernet
    controller
  • Only considered hard errors
  • No voting or synchronization to detect soft
    errors
  • Static epoch based protocol
  • Epoch length determined interrupt handling
    latency
  • No selective availability
  • Full duplication of memory

Bressoud, ACM transaction on computers 96
53
Support required for transparent redundancy
  • Instruction decrement counter (readable/writeable)
  • Trap on user non-deterministic instructions
  • Configurable isolation

54
Replay
  • Replay and Retrace software in Workstation
  • Logs all non-deterministic events and outcomes
  • Uses log to deterministically replay a VM later

55
Other research
  • Memory System Design
  • Fair Queuing Memory Systems (MICRO 2006)
  • Power Efficient DRAM speculation (HPCA 2008)
  • Intrinsically Compatible Process VMs (UW-TR 2006)

56
COVERT Operation
Applications
Operating System
Deliver Interrupt
Asynchronous interrupt
VMM 1
Interrupt synchronization module
Send instruction counter
Ack / Instruction counter
Replica VM
57
COVERT Operation
Applications
Operating System
Translated OS call
OS response
Nondeterministic OS call
VMM
OS call emulation module
Translated OS response
Replica VM
58
COVERT Operation
Applications
Operating System
Checkpoint system state
Checkpoint granularity
VMM
Checkpoint module
Replica VM
59
COVERT Operation
Applications
Operating System
Reconfigure and restart from checkpoint
Miscomparison
VMM 1
Recovery module
Permanent fault
Replica VM
60
COVERT Operation
Applications
Operating System
Restart from checkpoint
Miscomparison
VMM 1
Recovery module
Replica VM
61
Fault data
  • Weibull distribution
  • Decreasing hazard rate
  • Inverse function
  • X ?(-ln (U))1/ß
  • Different data for hard and soft errors
  • Ratio of FIT rates between components important

62
Fault types
  • Faults handled
  • Transient fault in core logic
  • Transient fault in system component logic
  • Permanent faults in components
  • Caused by wear out or manufacturing defect
  • Latent defects
  • Handle stuck at, bridging, open, delay
  • Faults not handled
  • Power delivery faults
  • Open circuit defects or shorts that can affect
    entire chip
  • Rarer because they require much higher
    temperature conditions

63
Why SPEC?
  • Single threaded
  • Diverse memory characteristics
  • Not speeding up the workload
  • Isolate performance difference from architecture
  • False positives in multithreaded execution
  • NonStop doesnt use multithreaded applications

64
Simulation parameters
65
Component Replacements (Mixed Memory)
1.0
0.8
Shared
0.6
Full Isolation
Normalized component replacements
0.4
Configurable
Isolation
0.2
0.0
10 Degradation
25 Degradation
50 Degradation
66
Component Replacements
67
Dynamic Power Reprovisioning
  • Configurable Isolation enables new optimizations
  • Dynamic power reprovisioning
  • Suitable for systems with chip wide thermal
    budget
  • Reassign power allotment of deconfigured
    components
  • Constraints
  • Voltage supply
  • Thermal budgets
  • 10 extra performance over configurable isolation

68
Systems with only hard error coverage
0.40
69
FIT Rate 10X
70
Component results - Shared
71
Component results - Configurable
72
Power Reprovisioning
73
Summary of current CMP availability features
  • Cores
  • Soft error detection limited to register files
    with ECC
  • No fault isolation in Opteron, Xeon and Niagara
  • Limited isolation in Montecito and Power5 in
    electrical and logical partitions
  • All architectures susceptible to soft errors in
    logic
  • Montecito in lock-step configuration is an
    exception

74
Summary of current CMP availability features
  • Caches
  • All architectures share at least one level of
    cache
  • Resilient to errors in array
  • ECC or parity checks at all cache levels
  • No tolerance to multi-bit errors in Opteron and
    Xeon
  • Niagara, Power5 Montecito handle some classes
    of multi-bit errors
  • No tolerance to errors in cache circuitry or
    interconnect
  • Entire socket susceptible to transient fault in
    cache controller state machine

75
HP NonStop
76
zSeries
77
Cache
78
Core
79
Summary of current CMP availability features
  • Memory
  • Most fault tolerant resource
  • All conditions for high availability typically
    satisfied in arrays
  • Sophisticated techniques present in all
    architectures
  • Chip kill, background scrubbing, DIMM sparing
  • Memory access control circuitry unprotected
  • Some architectures better than others
  • Shared Northbridge more vulnerable than on-chip
    controllers

80
Availability and Reliability
  • Availability MTBF/(MTBF MTTR)
  • Probability that system is operating when
    required
  • Reliability MTTF
  • Probability that components will work for a
    period of time within some confidence interval
  • Reliability Characteristic of device
  • Availability Characteristic of device system
    usage scenarios repair time
  • Serviceability Early diagnosis of errors to
    avoid downtime IBM call for repair

81
Component Replacements
Component replacements reduced by 60-100
82
Availability Sensitive Systems
83
Synchronization overhead (breakdown)
84
Availability and Reliability
  • Availability MTBF/(MTBF MTTR)
  • Probability that system is operating when
    required
  • Reliability MTTF
  • Probability that components will work for a
    period of time within some confidence interval
  • Reliability Characteristic of device
  • Availability Characteristic of device system
    usage scenarios repair time
  • Serviceability Early diagnosis of errors to
    avoid downtime IBM call for repair

85
Memory Duplication Overheads
  • Tightly lock-stepped systems dont duplicate
    memory
  • Loose lock-stepped systems fully duplicate memory

86
Partial duplication
87
Ring Configuration Unit (Duplication)
88
IBM Power 6 reconfiguration
  • Deallocate cores
  • Deallocate off chip L3 cache using on chip
    directory
  • L2 cache error requires deallocation of all cores
    associated with a cache bank
  • No deallocation of memory controllers etc.
Write a Comment
User Comments (0)
About PowerShow.com