Using Fault Model Enforcement FME to Improve Availability - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Using Fault Model Enforcement FME to Improve Availability

Description:

Using Fault Model Enforcement (FME) to Improve Availability. EASY '02 Workshop ... 'If the facts don't fit the theory, change the facts.' - Albert Einstein ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 26
Provided by: kiranna
Category:

less

Transcript and Presenter's Notes

Title: Using Fault Model Enforcement FME to Improve Availability


1
Using Fault Model Enforcement (FME) to Improve
Availability
  • EASY 02 Workshop
  • Kiran Nagaraja, Ricardo Bianchini,
  • Richard Martin, Thu Nguyen
  • Department of Computer Science
  • Rutgers University

2
Motivation
  • Network services are extremely complex
  • Typically many software and hardware components
  • Numerous fault points and types
  • E.g, nodes, disks, cables, links, switches, etc.
  • Extremely difficult for services to tolerate all
    these faults
  • Hard to reason about all possible faults
  • Difficult to determine actual fault
  • Many faults exhibit same runtime symptoms

3
FME Approach
  • Define a reduced abstract fault model
  • Components, faults, symptoms, component behavior
    during faults
  • Enforce this fault model at run-time
  • If an unexpected fault occurs, map to one that
    was planned for in the abstract model
  • If the facts dont fit the theory, change the
    facts. - Albert Einstein
  • Allow designer to concentrate on tolerating a
    well-defined, yet limited in complexity, set of
    faults

4
Our Study
  • Estimate potential impact of FME
  • Have not yet implemented FME
  • Case study PRESS cluster-based web server
  • PRESS has simple abstract fault model
  • In companion study, only achieve around three 9s
  • Study hypothetical improvement if FME was used to
    enforce PRESSs abstract fault model
  • FME can reduce the unavailability by up to 50

5
Outline
  • FME in more detail
  • Evaluation methodology
  • PRESS web server
  • Availability study
  • Related work
  • Conclusions
  • Future directions

6
Fault Model Enforcement (FME)
  • Enforce a reduced fault model at runtime
  • Allow service to perform correct recovery action
    to regain full functionality
  • How to enforce a reduced fault model?
  • Two ideas so far
  • Map an unexpected fault to an expected fault
  • E.g., crash a node if the network link connecting
    it to the switch fails
  • Fail outer component if sub-component fails
  • E.g., crash a node if the disk fails
  • How is it different from fail-stop ?
  • Allows reasoning about failures at a desired
    abstraction

7
Evaluation Methodology
  • Want to evaluate FMEs potential impact
  • Two phase methodology
  • Phase I - Single fault injection analysis
  • Define and inject faults on live system
  • Monitor system performance (throughput T) and
    availability(A) fraction of successful
    requests
  • Phase II - Use an analytical model to determine
    performability
  • Computes average availability and average
    throughput

8
Case Study PRESS Web Server
  • Cluster-based, locality-conscious web server
  • Serve requests out of global memory pool
  • Exclusion from pool ? lower performance
  • Simple fault model
  • Connection failure/lost heartbeats node failure
  • Recovery through rejoin of new node
  • Several versions developed over time
  • TCP, VIA
  • Different fault detection mechanism
  • Heart-beat for TCP
  • Connection breaks for VIA

9
Fault Set
  • Fault Load
  • Link down
  • Switch down
  • SCSI timeout
  • Node crash
  • Node freeze
  • Application crash
  • Application hang
  • All faults are modeled as fail-stop

10
PRESS with FME
  • Recovery upon fault model mismatch
  • Restart 0, 1 or all nodes?
  • FME approach reboot the appropriate node after a
    fault and its recovery have occurred
  • Link down reboot unreachable node
  • Switch down reboot all nodes
  • Disk failure reboot node with faulty disk
  • Node, application crash do nothing

11
Single-Fault Experiments
  • Setup 4 PC cluster running at 90 load
  • 3 versions TCP, TCP-HB, VIA
  • Use results to evaluate impact of FME

12
Single Fault - Results
Link Failure
Application Hang
13
Modeling Seven Stage Model
  • Input measured throughput and availability
  • Parameters MTTF, MTTR, operator on site time
  • Output average availability average throughput

14
Modeling Availability
  • Assumptions
  • Effects of faults are independent
  • Fault arrivals are exponential
  • Overall unavailability ST(unavailability of all
    faults)

15
Modeling Results
  • Application fault rate 1/month
  • Time to operator intervention 5 minutes
  • Unavailability of TCP-HB reduced by 50
  • VIA 36 reduction

16
Modeling Results
  • Application fault rate 1/day - unstable s/w
  • Time to operator intervention 5 minutes
  • Unavailability of TCP-HB reduces by gt 50
  • VIA 13 reduction

17
Related Work
  • Enforcing fail-stop
  • Tandem Non-Stop process pairs
  • Robust design with rigorous internal assertions
  • Fault detection and fail-over
  • HA-Linux
  • Reactive and proactive rejuvenation
  • Recursive restartability(ROC) Berkeley
    Stanford
  • Software rejuvenation Duke

18
Conclusion
  • FME allows for very simple fault models
  • FME can cut the unavailability by up to 50
  • Fault detection mechanism is crucial for
    effectiveness
  • Benefits increase with fault coverage

19
FME - Future Directions
  • How extensive should the fault model be?
  • Determines programming complexity/effort
  • How to prevent FME from reducing availability?
  • Bugs within enforcement?
  • When to declare a symptom a fault?
  • FME reduces human intervention
  • Are humans better at deciding?
  • 8-23 of recovery procedures are botched Brown
    2001

20
Thank you.
  • http//www.panic-lab.rutgers.edu/Projects/vivo

21
Communication Architecture
  • All operations by main thread are non-blocking
  • Separate send, receive and multiple disk helper
    threads
  • Filling up of queues could stall the entire node

22
Performability
  • Model computes 2 metrics
  • Average throughput (AT)
  • Average Availability (AA)
  • Performability
  • P Tn x log(AI)
  • log(AA)
  • AI Availability of Ideal system with 99.999
  • Log scale ratio allows a linear relationship with
    unavailability

23
Experiments Single-Fault Loads
  • 4 800Mhz PIII PCs, 206MB, 2x10000 SCSI disks,
    1Gb/s cLan interconnect (TCP or VIA)
  • PRESS 128MB file cache, static content
  • Clients constant rate 90 server capacity
  • Modified sclient Banga 97
  • Rutgers trace file size avg. request size

24
Mendosus Fault Injection
25
Phase II Modeling Performability
  • 5 minutes duration for operator intervention(E)
    and restart(F) stages
Write a Comment
User Comments (0)
About PowerShow.com