CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

CS556: Distributed Systems

Description:

The Last 5 Years: Availability Dark Ages. Ready for a Renaissance? ... tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/os... Skill level is much reduced ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 58
Provided by: mar177
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Recovery-Oriented Computing
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
Dependability in the Internet Era(J. Gray,
Microsoft Research, 2001)
Recovery-oriented Computing (D. Patterson, UCB,
2002)
3
The Last 5 Years Availability Dark Ages Ready
for a Renaissance?
  • Things got better, then things got a lot worse!

Telephone Systems
Availability
Cell phones
Computer Systems
Internet
4
DEPENDABILITY The 3 ITIES
  • RELIABILITY / INTEGRITY Does the right
    thing. (also MTTFgtgt1)
  • AVAILABILITY Does it now. (also 1 gtgt
    MTTR )
    MTTFMTTRSystem AvailabilityIf 90 of
    terminals up 99 of DB up? (gt89 of
    transactions are serviced on time).
  • Holistic vs. Reductionist view

Security
Integrity
Reliability
Availability
5
Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
  • Improving either MTTR or MTTF gives benefit
  • Simple redundancy does not help much.

6
Fault Model
  • Failures are independentSo, single fault
    tolerance is a big win
  • Hardware fails fast (dead disk, blue-screen)
  • Software fails-fast (or goes to sleep)
  • Software often repaired by reboot
  • Heisenbugs
  • Operations tasks major source of outage
  • Utility operations
  • Software upgrades

7
Disks (raid) the BIG Success Story
  • Duplex or Parity masks faults
  • Disks _at_ 1M hours (100 years)
  • But
  • controllers fail and
  • have 1,000s of disks.
  • Duplexing or parity, and dual path gives perfect
    disks
  • Wal-Mart never lost a byte (thousands of
    disks, hundreds of failures).
  • Only software/operations mistakes are left.

8
Fault Tolerance vs Disaster Tolerance
  • Fault-Tolerance mask local faults
  • RAID disks
  • Uninterruptible Power Supplies
  • Cluster Failover
  • Disaster Tolerance masks site failures
  • Protects against fire, flood, sabotage,..
  • Redundant system and service at remote site.

9
Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2

Tele Comm lines
1
2

1
1
.
2
Environment

2
5

Application Software
9
.
3

Operations
  • Vendor (hardware and software) 5 Months
  • Application software 9 Months
  • Communications lines 1.5 Years
  • Operations 2 Years
  • Environment 2 Years
  • 10 Weeks
  • 1,383 institutions reported (6/84 - 7/85)
  • 7,517 outages, MTTF 10 weeks, avg
    duration 90 MINUTES
  • To Get 10 Year MTTF, Must Attack All These Areas

10
Case Studies - Tandem Trends
  • MTTF improved
  • Shift from Hardware Maintenance to from 50 to
    10
  • to Software (62) Operations (15)
  • NOTE Systematic under-reporting of Environment
  • Operations errors
  • Application Software

11
Dependability Status circa 1995
  • 4-year MTTF gt 5 9s for well-managed sys.
    Fault Tolerance Works.
  • Hardware is GREAT (maintenance and MTTF).
  • Software masks most hardware faults.
  • Many hidden software outages in operations
  • New Software.
  • Utilities.
  • Make all hardware/software changes ONLINE.
  • Software seems to define a 30-year MTTF ceiling.
  • Reasonable Goal 100-year MTTF.
    class 4 today gt class 6 tomorrow.

12
Whats Happened Since Then?
  • Hardware got better
  • Software got better (even though it is more
    complex)
  • RAID is standard, Snapshots coming standard
  • Cluster in a box commodity failover
  • Remote replication is standard.

13
Un-managed
Availability
well-managed nodes
Masks some hardware failures
well-managed packs clones
Masks hardware failures, Operations tasks (e.g.
software upgrades) Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire,
move,) Masks some operations failures
14
Progress?
  • MTTF improved from 1950-1995
  • MTTR has not improved much since 1970 failover
  • Hardware and Software online change (pNp) is now
    standard
  • Then the Internet arrived
  • No project can take more than 3 months.
  • Time to market is everything
  • Change is good.

15
The Internet Changed Expectations
  • 1990
  • Phones delivered 99.999
  • ATMs delivered 99.99
  • Failures were front-page news.
  • Few hackers
  • Outages last an hour
  • 2000
  • Cellphones deliver 90
  • Web sites deliver 98
  • Failures are business-page news
  • Many hackers.
  • Outages last a day

This is progress?
16
Why (1) Complexity
  • Internet sites are MUCH more complex.
  • NAP
  • Firewall/proxy/ipsprayer
  • Web
  • DMZ
  • App server
  • DB server
  • Links to other sites
  • tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/o
    s
  • Skill level is much reduced

17
A Schematic of HotMail
  • 7,000 servers
  • 100 backend stores with 120TB (cooked)
  • 3 data centers
  • Links to
  • Passport
  • Ad-rotator
  • Internet Mail gateways
  • 1B messages per day
  • 150M mailboxes, 100M active
  • 400,000 new per day.

18
Why (2) Velocity
  • No project can take more than 13 weeks.
  • Time to market is everything
  • Functionality is everything
  • Faster, cheaper, badder ?

19
Why (3) Hackers
  • Hackers are a new increased threat
  • Any site can be attacked from anywhere
  • Motives include ego, malice, and greed.
  • Complexity makes it hard to protect sites.
  • Concentration of wealth makes attractive target
  • Why did you rob banks?
  • Willie Sutton Cause thats where the money is!

Note Eric Raymonds How to Become a Hacker
http//www.tuxedo.org/esr/faqs/hacker-howto.html
is the positive use of the term, here I mean
malicious and anti-social hackers.
20
How Bad Is It?
http//www-iepm.slac.stanford.edu/pinger/
  • Median monthly ping packet loss for 2/ 99

21
Microsoft.Com
  • Operations mis-configured a router
  • Took a day to diagnose and repair.
  • DOS attacks cost a fraction of a day.
  • Regular security patches.

22
BackEnd Servers are More Stable
  • Generally deliver 99.99
  • TerraServer for example single back-end
    failed after 2.5 y.
  • Went to 4-nodecluster
  • Fails every 2 mo.Transparent failover in 30
    sec.Online software upgradesSo 99.999 in
    backend

23
eBay A very honest site
http//www2.ebay.com/aw/announce.shtmltop
  • Publishes operations log.
  • Has 99 of scheduled uptime
  • Schedules about 2 hours/week down.
  • Has had some operations outages
  • Has had some DOS problems.

24
Not to throw stones but
  • Everyone has a serious problem.
  • The BEST people publish their stats.
  • The others HIDE their stats (check Netcraft
    to see who I mean).
  • We have good NODE-level availability 5-9s is
    reasonable.
  • We have TERRIBLE system-level availability 2-9s
    is the goal.

25
Recommendation 1
  • Continue progress on back-ends.
  • Make management easier (AUTOMATE IT!!!)
  • Measure
  • Compare best practices
  • Continue to look for better algorithms.
  • Live in fear
  • We are at 10,000 node servers
  • We are headed for 1,000,000 node servers

26
Recommendation 2
  • Current security approach is unworkable
  • Anonymous clients
  • Firewall is clueless
  • Incredible complexity
  • We cant win this game!
  • So change the rules (redefine the problem)
  • No anonymity
  • Unified authentication/authorization model
  • Single-function devices (with simple interfaces)
  • Only one-kind of interface (uddi/wsdl/soap/).

27
References
  • Adams, E. (1984). Optimizing Preventative
    Service of Software Products. IBM Journal of
    Research and Development. 28(1) 2-14.0
  • Anderson, T. and B. Randell. (1979). Computing
    Systems Reliability.
  • Garcia-Molina, H. and C. A. Polyzois. (1990).
    Issues in Disaster Recovery. 35th IEEE Compcon
    90. 573-577.
  • Gray, J. (1986). Why Do Computers Stop and What
    Can We Do About It. 5th Symposium on Reliability
    in Distributed Software and Database Systems.
    3-12.
  • Gray, J. (1990). A Census of Tandem System
    Availability between 1985 and 1990. IEEE
    Transactions on Reliability. 39(4) 409-418.
  • Gray, J. N., Reuter, A. (1993). Transaction
    Processing Concepts and Techniques. San Mateo,
    Morgan Kaufmann.
  • Lampson, B. W. (1981). Atomic Transactions.
    Distributed Systems -- Architecture and
    Implementation An Advanced Course. ACM,
    Springer-Verlag.
  • Laprie, J. C. (1985). Dependable Computing and
    Fault Tolerance Concepts and Terminology. 15th
    FTCS. 2-11.
  • Long, D.D., J. L. Carroll, and C.J. Park (1991).
    A study of the reliability of Internet sites.
    Proc 10th Symposium on Reliable Distributed
    Systems, pp. 177-186, Pisa, September 1991.
  • Darrell Long, Andrew Muir and Richard Golding,
    A Longitudinal Study of Internet Host
    Reliability,'' Proceedings of the Symposium on
    Reliable Distributed Systems, Bad Neuenahr,
    Germany IEEE, September 1995, p. 2-9
  • http//www.netcraft.com/ They have even better
    for-fee data as well, but for-free is really
    excellent.
  • http//www2.ebay.com/aw/announce.shtmltop eBay
    is an Excellent benchmark of best Internet
    practices
  • http//www-iepm.slac.stanford.edu/pinger/
    Network traffic/quality report, dated, but the
    others have died off!

28
The real scalability problems AME
  • Availability
  • systems should continue to meet quality of
    service goals despite hardware and software
    failures
  • Maintainability
  • systems should require only minimal ongoing human
    administration, regardless of scale or
    complexity Today, cost of maintenance 10X cost
    of purchase
  • Evolutionary Growth
  • systems should evolve gracefully in terms of
    performance, maintainability, and availability as
    they are grown/upgraded/expanded
  • These are problems at todays scales, and will
    only get worse as systems grow

29
Total Cost of Ownership (IBM)
  • Administration all people time
  • Backup Restore devices, media, and people time
  • Environmental floor space, power, air
    conditioning

30
Lessons learned from Past Projects for which
might help AME
  • Know how to improve performance (and cost)
  • Run system against workload, measure, innovate,
    repeat
  • Benchmarks standardize workloads, lead to
    competition, evaluate alternatives turns debates
    into numbers
  • Major improvements in Hardware Reliability
  • 1990 Disks 50,000 hour MTBF to 1,200,000 in 2000
  • PC motherboards from 100,000 to 1,000,000 hours
  • Yet Everything has an error rate
  • Well designed and manufactured HW gt1 fail/year
  • Well designed and tested SW gt 1 bug / 1000 lines
  • Well trained people doing routine tasks 1-2
  • Well run collocation site (e.g., Exodus) 1
    power failure per year, 1 network outage per year

31
Lessons learned from Past Projects for AME
  • Maintenance of machines (with state) expensive
  • 5X to 10X cost of HW
  • Stateless machines can be trivial to maintain
    (Hotmail)
  • System admin primarily keeps system available
  • System clever human working during failure
    uptime
  • Also plan for growth, software upgrades,
    configuration, fix performance bugs, do backup
  • Software upgrades necessary, dangerous
  • SW bugs fixed, new features added, but stability?
  • Admins try to skip upgrades, be the last to use
    one

32
Lessons learned from Internet
  • Realities of Internet service environment
  • hardware and software failures are inevitable
  • hardware reliability still imperfect
  • software reliability thwarted by rapid evolution
  • Internet system scale exposes second-order
    failure modes
  • system failure modes cannot be modeled or
    predicted
  • commodity components do not fail cleanly
  • black-box system design thwarts models
  • unanticipated failures are normal
  • human operators are imperfect
  • human error accounts for 50 of all system
    failures

Sources Gray86, Hamilton99, Menn99, Murphy95,
Perrow99, Pope86
33
Other Fields
  • How to minimize error affordances
  • Design for consistency between designer, system,
    user models good conceptual model
  • Simplify model so matches human limits working
    memory, problem solving
  • Make visible what the options are, and what are
    the consequences of actions
  • Exploit natural mappings between intentions and
    possible actions, actual state and what is
    perceived,
  • Use constraints (natural, artificial) to guide
    user
  • Design for errors. Assume their occurrence. Plan
    for error recovery. Make it easy to reverse
    action and make hard to perform irreversible
    ones.
  • When all else fails, standardize ease of use
    more important, only standardize as last resort

34
Cost of one hour of downtime (I)
  • Source http//www.techweb.com/internetsecurity/do
    c/95.html
  • April 2000
  • 65 of surveyed sites reported at least one
    user-visible outage in the previous 6-month
    period
  • 25 reported gt 3 outages
  • 3 leading causes
  • Scheduled downtime (35)
  • Service provider outages (22)
  • Server failure (21)

35
Cost of one hour of downtime (II)
  • Brokerage ? 6.45M
  • Credit card authorization ? 2.6M
  • Ebay.com ? 225K
  • Amazon.com ? 180K
  • Package shipping service ? 150K
  • Home shopping channel ? 119K
  • Catalog sales center ? 90K
  • Airline reservation center ? 89K
  • Cellular service activation ? 41K
  • On-line network fees ? 25K
  • ATM service fees ? 14K
  • Amounts in USD
  • This table ignores the loss due to wasting the
    time of employees

36
A metric of cost of downtime
  • A employees affected
  • B income affected by outage
  • EC average employee cost per hour
  • EI average income per hour

37
High availability (I)
  • Used to be a solved problem in the TP
    community
  • Fault-tolerant mainframes (IBM, Tandem)
  • Vendor-supplied HA TP system
  • Carefully tested tuned
  • Dumb terminal human agents
  • firewall for end-users
  • Well-designed, stable controlled environment

Not so for todays Internet
Key assumptions of traditional HA design no
longer hold
38
High availability (II)
  • TP functionality data access are directly
    exposed to customers
  • through a complicated heterogeneous
    conglomeration of interconnected systems
  • Databases, app. Servers, middleware, Web servers
  • constructed from a multi-vendor mix of
    off-the-shelf H/W S/W

Perceived availability is defined by the weakest
link
so its not enough to have a robust TP back-end
39
Traditional HA design assumptions
  • H/W S/W components can be built to have
    negligible (visible) failure rates
  • Failure modes can be predicted tolerated
  • Maintenance repair are error-free procedures

Attempt to minimize MTTF
40
Inevitability of unpredictable failures
  • arms race for new features ? less S/W testing !
  • Failure-prone H/W
  • Eg PC motherboards that do not have ECC memory
  • Google 8000-node cluster
  • 2-3 node failure rate per year
  • 1/3 of failures attributable to DRAM or memory
    bus failures
  • At least one node failure per week
  • Pressure complexity ? higher of human error
  • Charles Perrows theory of normal accidents
  • arising from multiple unexpected interactions
    of smaller failures and the recovery systems
    designed to handle them

Cascading failures
41
PSTN vs Internet
  • Study of 200 PSTN outages in the U.S.
  • that affected gt 30K customers
  • or lasted gt 30 minutes
  • H/W ? 22, S/W ? 8
  • Overload ? 11
  • Operator ? 59
  • Study of 3 popular Internet sites
  • H/W ? 15
  • S/W ? 34
  • Operator ? 51

42
Large-scale Internet services
  • Hosted in geographically distributed colocation
    facilities
  • Use mostly commodity H/W, OS networks
  • Multiple levels of redundancy load balancing
  • 3 tiers load balancing, stateless FE, back-end
  • Use primarily custom-written S/W
  • Undergo frequent S/W configuration updates
  • Operate their own 24x7 operation centers

Expected to be available 24x7 for access by users
around the globe
43
Characteristics that can be exploited for HA
  • Plentiful H/W ? allows for redundancy
  • Use of collocation facilities ? controlled
    environmental conditions resilience to
    large-scale disasters
  • Operators learn more about internals of S/W
  • so that they can detect resolve problems

44
Modern HA design assumptions
  • Accept the inevitability of unpredictable
    failures, in H/W, S/W operators
  • Build systems with a mentality of failure
    recovery repair, rather than failure avoidance

Attempt to minimize MTTR
Recovery-oriented Computing
  • Redundancy of H/W data
  • Partitionable design for fault containment
  • Efficient fault detection

45
User-visible failures
  • Operator errors are a primary cause !
  • Service FEs are less robust than back-ends
  • Online testing more thoroughly detecting and
    exposing component failures can reduce observed
    failure rates
  • Injection of test cases, including faults load
  • Root-cause analysis (dependency checking)

46
Recovery-Oriented Computing Hypothesis
  • If a problem has no solution, it may not be a
    problem, but a fact, not to be solved, but to be
    coped with over time
  • Shimon Peres
  • Failures are a fact, and recovery/repair is how
    we cope with them
  • Improving recovery/repair improves availability
  • UnAvailability MTTR
  • MTTF
  • 1/10th MTTR just as valuable as 10X MTBF
  • Since major Sys Admin job is recovery after
    failure, ROC also helps with maintenance

(assuming MTTR much less MTTF)
47
Tentative ROC Principles 1 Isolation and
Redundancy
  • System is Partitionable
  • To isolate faults
  • To enable online repair/recovery
  • To enable online HW growth/SW upgrade
  • To enable operator training/expand experience on
    portions of real system
  • Techniques Geographically replicated sites,
    Shared nothing cluster, Separate address space
    inside CPU
  • System is Redundant
  • Sufficient HW redundancy/Data replication gt part
    of system down but satisfactory service still
    available
  • Enough to survive 2nd failure during recovery
  • Techniques RAID-6, N-copies of data

48
Tentative ROC Principles 2 Online verification
  • System enables input insertion, output check of
    all modules (including fault insertion)
  • To check module sanity to find failures faster
  • To test corrections of recovery mechanisms
  • insert (random) faults and known-incorrect
    inputs
  • also enables availability benchmarks
  • To expose remove latent errors from each system
  • To operator train/expand experience of operator
  • Periodic reports to management on skills
  • To discover if warning system is broken
  • Techniques Global invariants Topology
    discovery Program Checking (SW ECC)

49
Tentative ROC Principles 3 Undo support
  • ROC system should offer Undo
  • To recover from operator errors
  • People detect 3 of 4 errors, so why not undo?
  • To recover from inevitable SW errors
  • Restore entire system state to pre-error version
  • To simplify maintenance by supporting trial and
    error
  • Create a forgiving/reversible environment
  • To recover from operator training after fault
    insertion
  • To replace traditional backup and restore
  • Techniques Checkpointing, Logging time travel
    (log structured) file system Virtual machines
    Go Back file protection

50
Tentative ROC Principles 4 Diagnosis Support
  • System assists human in diagnosing problems
  • Root-cause analysis to suggest possible failure
    points
  • Track resource dependencies of all requests
  • Correlate symptomatic requests with component
    dependency model to isolate culprit components
  • health reporting to detect failed/failing
    components
  • Failure information, self-test results propagated
    upwards
  • Discovery of network, power topology
  • Dont rely on things connected according to plans
  • Techniques Stamp data blocks with modules used
    Log faults, errors, failures and recovery methods

51
Towards AME via ROC
  • New foundation to reduce MTTR
  • Cope with fact that people, SW, HW fail (Peress
    Law)
  • Transactions/snapshots to undo failures, bad
    repairs
  • Recovery benchmarks to evaluate MTTR innovations
  • Interfaces to allow fault insertion, input
    insertion, report module errors, report module
    performance
  • Module I/O error checking and module isolation
  • Log errors and solutions for root cause analysis,
    give ranking to potential solutions to problem
    problem
  • Significantly reducing MTTR (HW/SW/LW) gt
    Significantly increased availability
    Significantly improved maintenance costs

52
Availability benchmark methodology
  • Goal quantify variation in QoS metrics as events
    occur that affect system availability
  • Leverage existing performance benchmarks
  • to generate fair workloads
  • to measure trace quality of service metrics
  • Use fault injection to compromise system
  • hardware faults (disk, memory, network, power)
  • software faults (corrupt input, driver error
    returns)
  • maintenance events (repairs, SW/HW upgrades)
  • Examine single-fault and multi-fault workloads
  • the availability analogues of performance micro-
    and macro-benchmarks

53
An Approach to ROC
  • 4 Parts to Time to Recovery
  • 1) Time to detect error,
  • 2) Time to pinpoint error (root cause
    analysis),
  • 3) Time to chose try several possible solutions
    to fix error, and
  • 4) Time to fix error
  • Result is Principles of Recovery Oriented
    Computers (ROC)

54
An Approach to ROC
  • 1) Time to Detect errors
  • Include interfaces that report faults/errors from
    components
  • May allow application/system to predict/identify
    failures prediction really lowers MTTR
  • Periodic insertion of test inputs into system
    with known results vs. wait for failure reports
  • Reduce time to detect
  • Better than simple pulse check

55
An Approach to ROC
  • 2) Time to Pinpoint error
  • Error checking at edges of each component
  • Program checking analogy if computation is
    O(nx), (x gt1) and if check is O(n), little impact
    to check
  • E.g., check if list is sorted before return a
    sort
  • Design each component to allow isolation and
    insert test inputs to see if performs
  • Keep history of failure symptoms/reasons and
    recent behavior (root cause analysis)
  • Stamp each datum with all the modules it touched?

56
An Approach to ROC
  • 3) Time to try possible solutions
  • History of errors/solutions
  • Undo of any repair to allow trial of possible
    solutions
  • Support of snapshots, transactions/logging
    fundamental in system
  • Since disk capacity, bandwidth is fastest growing
    technology, use it to improve repair?
  • Caching at many levels of systems provides
    redundancy that may be used for transactions?
  • SW errors corrected by undo?
  • Human Errors corrected by undo?

57
An Approach to ROC
  • 4) Time to fix error
  • Find failure workload, use repair benchmarks
  • Competition leads to improved MTTR
  • Include interfaces that allow Repair events to be
    systematically tested
  • Predictable fault insertion allows debugging of
    repair as well as benchmarking MTTR
  • Since people make mistakes during repair, undo
    for any maintenance event
  • Replace wrong disk in RAID system on a failure
    undo and replace bad disk without losing info
  • Recovery oriented gt accommodate HW/SW/human
    errors during repair
Write a Comment
User Comments (0)
About PowerShow.com