Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Recovery-Oriented Computing
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Dependability in the Internet Era(J. Gray,
Microsoft Research, 2001)
Recovery-oriented Computing (D. Patterson, UCB,
2002)
3The Last 5 Years Availability Dark Ages Ready
for a Renaissance?
- Things got better, then things got a lot worse!
Telephone Systems
Availability
Cell phones
Computer Systems
Internet
4DEPENDABILITY The 3 ITIES
- RELIABILITY / INTEGRITY Does the right
thing. (also MTTFgtgt1) - AVAILABILITY Does it now. (also 1 gtgt
MTTR )
MTTFMTTRSystem AvailabilityIf 90 of
terminals up 99 of DB up? (gt89 of
transactions are serviced on time). - Holistic vs. Reductionist view
Security
Integrity
Reliability
Availability
5Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
-
- Improving either MTTR or MTTF gives benefit
- Simple redundancy does not help much.
6Fault Model
- Failures are independentSo, single fault
tolerance is a big win - Hardware fails fast (dead disk, blue-screen)
- Software fails-fast (or goes to sleep)
- Software often repaired by reboot
- Heisenbugs
- Operations tasks major source of outage
- Utility operations
- Software upgrades
7Disks (raid) the BIG Success Story
- Duplex or Parity masks faults
- Disks _at_ 1M hours (100 years)
- But
- controllers fail and
- have 1,000s of disks.
- Duplexing or parity, and dual path gives perfect
disks - Wal-Mart never lost a byte (thousands of
disks, hundreds of failures). - Only software/operations mistakes are left.
8Fault Tolerance vs Disaster Tolerance
- Fault-Tolerance mask local faults
- RAID disks
- Uninterruptible Power Supplies
- Cluster Failover
- Disaster Tolerance masks site failures
- Protects against fire, flood, sabotage,..
- Redundant system and service at remote site.
9Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2
Tele Comm lines
1
2
1
1
.
2
Environment
2
5
Application Software
9
.
3
Operations
- Vendor (hardware and software) 5 Months
- Application software 9 Months
- Communications lines 1.5 Years
- Operations 2 Years
- Environment 2 Years
- 10 Weeks
- 1,383 institutions reported (6/84 - 7/85)
- 7,517 outages, MTTF 10 weeks, avg
duration 90 MINUTES - To Get 10 Year MTTF, Must Attack All These Areas
10Case Studies - Tandem Trends
- MTTF improved
- Shift from Hardware Maintenance to from 50 to
10 - to Software (62) Operations (15)
- NOTE Systematic under-reporting of Environment
- Operations errors
- Application Software
11Dependability Status circa 1995
- 4-year MTTF gt 5 9s for well-managed sys.
Fault Tolerance Works. - Hardware is GREAT (maintenance and MTTF).
- Software masks most hardware faults.
- Many hidden software outages in operations
- New Software.
- Utilities.
- Make all hardware/software changes ONLINE.
- Software seems to define a 30-year MTTF ceiling.
- Reasonable Goal 100-year MTTF.
class 4 today gt class 6 tomorrow.
12Whats Happened Since Then?
- Hardware got better
- Software got better (even though it is more
complex) - RAID is standard, Snapshots coming standard
- Cluster in a box commodity failover
- Remote replication is standard.
13Un-managed
Availability
well-managed nodes
Masks some hardware failures
well-managed packs clones
Masks hardware failures, Operations tasks (e.g.
software upgrades) Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire,
move,) Masks some operations failures
14Progress?
- MTTF improved from 1950-1995
- MTTR has not improved much since 1970 failover
- Hardware and Software online change (pNp) is now
standard - Then the Internet arrived
- No project can take more than 3 months.
- Time to market is everything
- Change is good.
15The Internet Changed Expectations
- 1990
- Phones delivered 99.999
- ATMs delivered 99.99
- Failures were front-page news.
- Few hackers
- Outages last an hour
- 2000
- Cellphones deliver 90
- Web sites deliver 98
- Failures are business-page news
- Many hackers.
- Outages last a day
This is progress?
16Why (1) Complexity
- Internet sites are MUCH more complex.
- NAP
- Firewall/proxy/ipsprayer
- Web
- DMZ
- App server
- DB server
- Links to other sites
- tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/o
s - Skill level is much reduced
17A Schematic of HotMail
- 7,000 servers
- 100 backend stores with 120TB (cooked)
- 3 data centers
- Links to
- Passport
- Ad-rotator
- Internet Mail gateways
-
- 1B messages per day
- 150M mailboxes, 100M active
- 400,000 new per day.
18Why (2) Velocity
- No project can take more than 13 weeks.
- Time to market is everything
- Functionality is everything
- Faster, cheaper, badder ?
19Why (3) Hackers
- Hackers are a new increased threat
- Any site can be attacked from anywhere
- Motives include ego, malice, and greed.
- Complexity makes it hard to protect sites.
- Concentration of wealth makes attractive target
- Why did you rob banks?
- Willie Sutton Cause thats where the money is!
Note Eric Raymonds How to Become a Hacker
http//www.tuxedo.org/esr/faqs/hacker-howto.html
is the positive use of the term, here I mean
malicious and anti-social hackers.
20How Bad Is It?
http//www-iepm.slac.stanford.edu/pinger/
- Median monthly ping packet loss for 2/ 99
21Microsoft.Com
- Operations mis-configured a router
- Took a day to diagnose and repair.
- DOS attacks cost a fraction of a day.
- Regular security patches.
22BackEnd Servers are More Stable
- Generally deliver 99.99
- TerraServer for example single back-end
failed after 2.5 y. - Went to 4-nodecluster
- Fails every 2 mo.Transparent failover in 30
sec.Online software upgradesSo 99.999 in
backend
23eBay A very honest site
http//www2.ebay.com/aw/announce.shtmltop
- Publishes operations log.
- Has 99 of scheduled uptime
- Schedules about 2 hours/week down.
- Has had some operations outages
- Has had some DOS problems.
24Not to throw stones but
- Everyone has a serious problem.
- The BEST people publish their stats.
- The others HIDE their stats (check Netcraft
to see who I mean). - We have good NODE-level availability 5-9s is
reasonable. - We have TERRIBLE system-level availability 2-9s
is the goal.
25Recommendation 1
- Continue progress on back-ends.
- Make management easier (AUTOMATE IT!!!)
- Measure
- Compare best practices
- Continue to look for better algorithms.
- Live in fear
- We are at 10,000 node servers
- We are headed for 1,000,000 node servers
26Recommendation 2
- Current security approach is unworkable
- Anonymous clients
- Firewall is clueless
- Incredible complexity
- We cant win this game!
- So change the rules (redefine the problem)
- No anonymity
- Unified authentication/authorization model
- Single-function devices (with simple interfaces)
- Only one-kind of interface (uddi/wsdl/soap/).
27References
- Adams, E. (1984). Optimizing Preventative
Service of Software Products. IBM Journal of
Research and Development. 28(1) 2-14.0 - Anderson, T. and B. Randell. (1979). Computing
Systems Reliability. - Garcia-Molina, H. and C. A. Polyzois. (1990).
Issues in Disaster Recovery. 35th IEEE Compcon
90. 573-577. - Gray, J. (1986). Why Do Computers Stop and What
Can We Do About It. 5th Symposium on Reliability
in Distributed Software and Database Systems.
3-12. - Gray, J. (1990). A Census of Tandem System
Availability between 1985 and 1990. IEEE
Transactions on Reliability. 39(4) 409-418. - Gray, J. N., Reuter, A. (1993). Transaction
Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann. - Lampson, B. W. (1981). Atomic Transactions.
Distributed Systems -- Architecture and
Implementation An Advanced Course. ACM,
Springer-Verlag. - Laprie, J. C. (1985). Dependable Computing and
Fault Tolerance Concepts and Terminology. 15th
FTCS. 2-11. - Long, D.D., J. L. Carroll, and C.J. Park (1991).
A study of the reliability of Internet sites.
Proc 10th Symposium on Reliable Distributed
Systems, pp. 177-186, Pisa, September 1991. - Darrell Long, Andrew Muir and Richard Golding,
A Longitudinal Study of Internet Host
Reliability,'' Proceedings of the Symposium on
Reliable Distributed Systems, Bad Neuenahr,
Germany IEEE, September 1995, p. 2-9 - http//www.netcraft.com/ They have even better
for-fee data as well, but for-free is really
excellent. - http//www2.ebay.com/aw/announce.shtmltop eBay
is an Excellent benchmark of best Internet
practices - http//www-iepm.slac.stanford.edu/pinger/
Network traffic/quality report, dated, but the
others have died off!
28The real scalability problems AME
- Availability
- systems should continue to meet quality of
service goals despite hardware and software
failures - Maintainability
- systems should require only minimal ongoing human
administration, regardless of scale or
complexity Today, cost of maintenance 10X cost
of purchase - Evolutionary Growth
- systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded - These are problems at todays scales, and will
only get worse as systems grow
29Total Cost of Ownership (IBM)
- Administration all people time
- Backup Restore devices, media, and people time
- Environmental floor space, power, air
conditioning
30Lessons learned from Past Projects for which
might help AME
- Know how to improve performance (and cost)
- Run system against workload, measure, innovate,
repeat - Benchmarks standardize workloads, lead to
competition, evaluate alternatives turns debates
into numbers - Major improvements in Hardware Reliability
- 1990 Disks 50,000 hour MTBF to 1,200,000 in 2000
- PC motherboards from 100,000 to 1,000,000 hours
- Yet Everything has an error rate
- Well designed and manufactured HW gt1 fail/year
- Well designed and tested SW gt 1 bug / 1000 lines
- Well trained people doing routine tasks 1-2
- Well run collocation site (e.g., Exodus) 1
power failure per year, 1 network outage per year
31Lessons learned from Past Projects for AME
- Maintenance of machines (with state) expensive
- 5X to 10X cost of HW
- Stateless machines can be trivial to maintain
(Hotmail) - System admin primarily keeps system available
- System clever human working during failure
uptime - Also plan for growth, software upgrades,
configuration, fix performance bugs, do backup - Software upgrades necessary, dangerous
- SW bugs fixed, new features added, but stability?
- Admins try to skip upgrades, be the last to use
one
32Lessons learned from Internet
- Realities of Internet service environment
- hardware and software failures are inevitable
- hardware reliability still imperfect
- software reliability thwarted by rapid evolution
- Internet system scale exposes second-order
failure modes - system failure modes cannot be modeled or
predicted - commodity components do not fail cleanly
- black-box system design thwarts models
- unanticipated failures are normal
- human operators are imperfect
- human error accounts for 50 of all system
failures
Sources Gray86, Hamilton99, Menn99, Murphy95,
Perrow99, Pope86
33Other Fields
- How to minimize error affordances
- Design for consistency between designer, system,
user models good conceptual model - Simplify model so matches human limits working
memory, problem solving - Make visible what the options are, and what are
the consequences of actions - Exploit natural mappings between intentions and
possible actions, actual state and what is
perceived, - Use constraints (natural, artificial) to guide
user - Design for errors. Assume their occurrence. Plan
for error recovery. Make it easy to reverse
action and make hard to perform irreversible
ones. - When all else fails, standardize ease of use
more important, only standardize as last resort
34Cost of one hour of downtime (I)
- Source http//www.techweb.com/internetsecurity/do
c/95.html - April 2000
- 65 of surveyed sites reported at least one
user-visible outage in the previous 6-month
period - 25 reported gt 3 outages
- 3 leading causes
- Scheduled downtime (35)
- Service provider outages (22)
- Server failure (21)
35Cost of one hour of downtime (II)
- Brokerage ? 6.45M
- Credit card authorization ? 2.6M
- Ebay.com ? 225K
- Amazon.com ? 180K
- Package shipping service ? 150K
- Home shopping channel ? 119K
- Catalog sales center ? 90K
- Airline reservation center ? 89K
- Cellular service activation ? 41K
- On-line network fees ? 25K
- ATM service fees ? 14K
- Amounts in USD
- This table ignores the loss due to wasting the
time of employees
36A metric of cost of downtime
- A employees affected
- B income affected by outage
- EC average employee cost per hour
- EI average income per hour
37High availability (I)
- Used to be a solved problem in the TP
community - Fault-tolerant mainframes (IBM, Tandem)
- Vendor-supplied HA TP system
- Carefully tested tuned
- Dumb terminal human agents
- firewall for end-users
- Well-designed, stable controlled environment
Not so for todays Internet
Key assumptions of traditional HA design no
longer hold
38High availability (II)
- TP functionality data access are directly
exposed to customers - through a complicated heterogeneous
conglomeration of interconnected systems - Databases, app. Servers, middleware, Web servers
- constructed from a multi-vendor mix of
off-the-shelf H/W S/W
Perceived availability is defined by the weakest
link
so its not enough to have a robust TP back-end
39Traditional HA design assumptions
- H/W S/W components can be built to have
negligible (visible) failure rates - Failure modes can be predicted tolerated
- Maintenance repair are error-free procedures
Attempt to minimize MTTF
40Inevitability of unpredictable failures
- arms race for new features ? less S/W testing !
- Failure-prone H/W
- Eg PC motherboards that do not have ECC memory
- Google 8000-node cluster
- 2-3 node failure rate per year
- 1/3 of failures attributable to DRAM or memory
bus failures - At least one node failure per week
- Pressure complexity ? higher of human error
- Charles Perrows theory of normal accidents
- arising from multiple unexpected interactions
of smaller failures and the recovery systems
designed to handle them
Cascading failures
41PSTN vs Internet
- Study of 200 PSTN outages in the U.S.
- that affected gt 30K customers
- or lasted gt 30 minutes
- H/W ? 22, S/W ? 8
- Overload ? 11
- Operator ? 59
- Study of 3 popular Internet sites
- H/W ? 15
- S/W ? 34
- Operator ? 51
42Large-scale Internet services
- Hosted in geographically distributed colocation
facilities - Use mostly commodity H/W, OS networks
- Multiple levels of redundancy load balancing
- 3 tiers load balancing, stateless FE, back-end
- Use primarily custom-written S/W
- Undergo frequent S/W configuration updates
- Operate their own 24x7 operation centers
Expected to be available 24x7 for access by users
around the globe
43Characteristics that can be exploited for HA
- Plentiful H/W ? allows for redundancy
- Use of collocation facilities ? controlled
environmental conditions resilience to
large-scale disasters - Operators learn more about internals of S/W
- so that they can detect resolve problems
44Modern HA design assumptions
- Accept the inevitability of unpredictable
failures, in H/W, S/W operators - Build systems with a mentality of failure
recovery repair, rather than failure avoidance
Attempt to minimize MTTR
Recovery-oriented Computing
- Redundancy of H/W data
- Partitionable design for fault containment
- Efficient fault detection
45User-visible failures
- Operator errors are a primary cause !
- Service FEs are less robust than back-ends
- Online testing more thoroughly detecting and
exposing component failures can reduce observed
failure rates - Injection of test cases, including faults load
- Root-cause analysis (dependency checking)
46Recovery-Oriented Computing Hypothesis
- If a problem has no solution, it may not be a
problem, but a fact, not to be solved, but to be
coped with over time - Shimon Peres
- Failures are a fact, and recovery/repair is how
we cope with them - Improving recovery/repair improves availability
- UnAvailability MTTR
- MTTF
- 1/10th MTTR just as valuable as 10X MTBF
- Since major Sys Admin job is recovery after
failure, ROC also helps with maintenance
(assuming MTTR much less MTTF)
47Tentative ROC Principles 1 Isolation and
Redundancy
- System is Partitionable
- To isolate faults
- To enable online repair/recovery
- To enable online HW growth/SW upgrade
- To enable operator training/expand experience on
portions of real system - Techniques Geographically replicated sites,
Shared nothing cluster, Separate address space
inside CPU - System is Redundant
- Sufficient HW redundancy/Data replication gt part
of system down but satisfactory service still
available - Enough to survive 2nd failure during recovery
- Techniques RAID-6, N-copies of data
48Tentative ROC Principles 2 Online verification
- System enables input insertion, output check of
all modules (including fault insertion) - To check module sanity to find failures faster
- To test corrections of recovery mechanisms
- insert (random) faults and known-incorrect
inputs - also enables availability benchmarks
- To expose remove latent errors from each system
- To operator train/expand experience of operator
- Periodic reports to management on skills
- To discover if warning system is broken
- Techniques Global invariants Topology
discovery Program Checking (SW ECC)
49Tentative ROC Principles 3 Undo support
- ROC system should offer Undo
- To recover from operator errors
- People detect 3 of 4 errors, so why not undo?
- To recover from inevitable SW errors
- Restore entire system state to pre-error version
- To simplify maintenance by supporting trial and
error - Create a forgiving/reversible environment
- To recover from operator training after fault
insertion - To replace traditional backup and restore
- Techniques Checkpointing, Logging time travel
(log structured) file system Virtual machines
Go Back file protection
50Tentative ROC Principles 4 Diagnosis Support
- System assists human in diagnosing problems
- Root-cause analysis to suggest possible failure
points - Track resource dependencies of all requests
- Correlate symptomatic requests with component
dependency model to isolate culprit components - health reporting to detect failed/failing
components - Failure information, self-test results propagated
upwards - Discovery of network, power topology
- Dont rely on things connected according to plans
- Techniques Stamp data blocks with modules used
Log faults, errors, failures and recovery methods
51Towards AME via ROC
- New foundation to reduce MTTR
- Cope with fact that people, SW, HW fail (Peress
Law) - Transactions/snapshots to undo failures, bad
repairs - Recovery benchmarks to evaluate MTTR innovations
- Interfaces to allow fault insertion, input
insertion, report module errors, report module
performance - Module I/O error checking and module isolation
- Log errors and solutions for root cause analysis,
give ranking to potential solutions to problem
problem - Significantly reducing MTTR (HW/SW/LW) gt
Significantly increased availability
Significantly improved maintenance costs
52Availability benchmark methodology
- Goal quantify variation in QoS metrics as events
occur that affect system availability - Leverage existing performance benchmarks
- to generate fair workloads
- to measure trace quality of service metrics
- Use fault injection to compromise system
- hardware faults (disk, memory, network, power)
- software faults (corrupt input, driver error
returns) - maintenance events (repairs, SW/HW upgrades)
- Examine single-fault and multi-fault workloads
- the availability analogues of performance micro-
and macro-benchmarks
53An Approach to ROC
- 4 Parts to Time to Recovery
- 1) Time to detect error,
- 2) Time to pinpoint error (root cause
analysis), - 3) Time to chose try several possible solutions
to fix error, and - 4) Time to fix error
- Result is Principles of Recovery Oriented
Computers (ROC)
54An Approach to ROC
- 1) Time to Detect errors
- Include interfaces that report faults/errors from
components - May allow application/system to predict/identify
failures prediction really lowers MTTR - Periodic insertion of test inputs into system
with known results vs. wait for failure reports - Reduce time to detect
- Better than simple pulse check
55An Approach to ROC
- 2) Time to Pinpoint error
- Error checking at edges of each component
- Program checking analogy if computation is
O(nx), (x gt1) and if check is O(n), little impact
to check - E.g., check if list is sorted before return a
sort - Design each component to allow isolation and
insert test inputs to see if performs - Keep history of failure symptoms/reasons and
recent behavior (root cause analysis) - Stamp each datum with all the modules it touched?
56An Approach to ROC
- 3) Time to try possible solutions
- History of errors/solutions
- Undo of any repair to allow trial of possible
solutions - Support of snapshots, transactions/logging
fundamental in system - Since disk capacity, bandwidth is fastest growing
technology, use it to improve repair? - Caching at many levels of systems provides
redundancy that may be used for transactions? - SW errors corrected by undo?
- Human Errors corrected by undo?
57An Approach to ROC
- 4) Time to fix error
- Find failure workload, use repair benchmarks
- Competition leads to improved MTTR
- Include interfaces that allow Repair events to be
systematically tested - Predictable fault insertion allows debugging of
repair as well as benchmarking MTTR - Since people make mistakes during repair, undo
for any maintenance event - Replace wrong disk in RAID system on a failure
undo and replace bad disk without losing info - Recovery oriented gt accommodate HW/SW/human
errors during repair