Dependability in the Internet Era - PowerPoint PPT Presentation

About This Presentation
Title:

Dependability in the Internet Era

Description:

Wal-Mart never lost a byte (thousands of disks, hundreds of failures) ... The BEST people publish their stats. The others HIDE their stats ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 32
Provided by: jimg178
Category:

less

Transcript and Presenter's Notes

Title: Dependability in the Internet Era


1
Dependability in the Internet Era
2
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations

3
PreviewThe Last 5 Years Availability Dark Ages
Ready for a Renaissance?
  • Things got better, then things got a lot worse!

99.999
Telephone Systems
99.999
99.99
Availability
Cell phones
99.9
Computer Systems
99
Internet
9
1950
1960
1970
1980
1990
2000
4
DEPENDABILITY The 3 ITIES
  • RELIABILITY / INTEGRITY Does the right
    thing. (also MTTFgtgt1)
  • AVAILABILITY Does it now. (also 1 gtgt
    MTTR )
    MTTFMTTRSystem AvailabilityIf 90 of
    terminals up 99 of DB up? (gt89 of
    transactions are serviced on time).
  • Holistic vs. Reductionist view

Security
Integrity
Reliability
Availability
5
Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
  • Improving either MTTR or MTTF gives benefit
  • Simple redundancy does not help much.

6
Fault Model
  • Failures are independentSo, single fault
    tolerance is a big win
  • Hardware fails fast (dead disk, blue-screen)
  • Software fails-fast (or goes to sleep)
  • Software often repaired by reboot
  • Heisenbugs
  • Operations tasks major source of outage
  • Utility operations
  • Software upgrades

7
Disks (raid) the BIG Success Story
  • Duplex or Parity masks faults
  • Disks _at_ 1M hours (100 years)
  • But
  • controllers fail and
  • have 1,000s of disks.
  • Duplexing or parity, and dual path gives perfect
    disks
  • Wal-Mart never lost a byte (thousands of
    disks, hundreds of failures).
  • Only software/operations mistakes are left.

8
Fault Tolerance vs Disaster Tolerance
  • Fault-Tolerance mask local faults
  • RAID disks
  • Uninterruptible Power Supplies
  • Cluster Failover
  • Disaster Tolerance masks site failures
  • Protects against fire, flood, sabotage,..
  • Redundant system and service at remote site.

9
Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2

Tele Comm lines
1
2

1
1
.
2
Environment

2
5

Application Software
9
.
3

Operations
  • Vendor (hardware and software) 5 Months
  • Application software 9 Months
  • Communications lines 1.5 Years
  • Operations 2 Years
  • Environment 2 Years
  • 10 Weeks
  • 1,383 institutions reported (6/84 - 7/85)
  • 7,517 outages, MTTF 10 weeks, avg
    duration 90 MINUTES
  • To Get 10 Year MTTF, Must Attack All These Areas

10
Case Studies - Tandem Trends
  • MTTF improved
  • Shift from Hardware Maintenance to from 50 to
    10
  • to Software (62) Operations (15)
  • NOTE Systematic under-reporting of Environment
  • Operations errors
  • Application Software

11
Dependability Status circa 1995
  • 4-year MTTF gt 5 9s for well-managed sys.
    Fault Tolerance Works.
  • Hardware is GREAT (maintenance and MTTF).
  • Software masks most hardware faults.
  • Many hidden software outages in operations
  • New Software.
  • Utilities.
  • Make all hardware/software changes ONLINE.
  • Software seems to define a 30-year MTTF ceiling.
  • Reasonable Goal 100-year MTTF.
    class 4 today gt class 6 tomorrow.

12
Whats Happened Since Then?
  • Hardware got better
  • Software got better (even though it is more
    complex)
  • Raid is standard, Snapshots coming standard
  • Cluster in a box commodity failover
  • Remote replication is standard.

13
Availability
9
9
9
9
9
Un-managed
Availability
well-managed nodes
Masks some hardware failures
well-managed packs clones
Masks hardware failures, Operations tasks (e.g.
software upgrades) Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire,
move,) Masks some operations failures
14
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations

15
Progress?
  • MTTF improved from 1950-1995
  • MTTR has not improved much since 1970 failover
  • Hardware and Software online change (pNp) is now
    standard
  • Then the Internet arrived
  • No project can take more than 3 months.
  • Time to market is everything
  • Change is good.

16
The Internet Changed Expectations
  • 1990
  • Phones delivered 99.999
  • ATMs delivered 99.99
  • Failures were front-page news.
  • Few hackers
  • Outages last an hour
  • 2000
  • Cellphones deliver 90
  • Web sites deliver 98
  • Failures are business-page news
  • Many hackers.
  • Outages last a day

This is progress?
17
Why (1) Complexity
  • Internet sites are MUCH more complex.
  • NAP
  • Firewall/proxy/ipsprayer
  • Web
  • DMZ
  • App server
  • DB server
  • Links to other sites
  • tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/o
    s
  • Skill level is much reduced

18
One of the Data Centers (500 servers)
19
A Schematic of HotMail
  • 7,000 servers
  • 100 backend stores with 120TB (cooked)
  • 3 data centers
  • Links to
  • Passport
  • Ad-rotator
  • Internet Mail gateways
  • 1B messages per day
  • 150M mailboxes, 100M active
  • 400,000 new per day.

20
Why (2) Velocity
  • No project can take more than 13 weeks.
  • Time to market is everything
  • Functionality is everything
  • Faster, cheaper, badder ?

21
Why (3) Hackers
  • Hackers are a new increased threat
  • Any site can be attacked from anywhere
  • Motives include ego, malice, and greed.
  • Complexity makes it hard to protect sites.
  • Concentration of wealth makes attractive target
  • Why did you rob banks?
  • Willie Sutton Cause thats where the money is!

Note Eric Raymonds How to Become a Hacker
http//www.tuxedo.org/esr/faqs/hacker-howto.html
is the positive use of the term, here I mean
malicious and anti-social hackers.
22
How Bad Is It?
http//www-iepm.slac.stanford.edu/
Connectivity is poor.
23
How Bad Is It?
http//www-iepm.slac.stanford.edu/pinger/
  • Median monthly ping packet loss for 2/ 99

24
Microsoft.Com
  • Operations mis-configured a router
  • Took a day to diagnose and repair.
  • DOS attacks cost a fraction of a day.
  • Regular security patches.

25
BackEnd Servers are More Stable
  • Generally deliver 99.99
  • TerraServer for example single back-end
    failed after 2.5 y.
  • Went to 4-nodecluster
  • Fails every 2 mo.Transparent failover in 30
    sec.Online software upgradesSo 99.999 in
    backend

26
eBay A very honest site
http//www2.ebay.com/aw/announce.shtmltop
  • Publishes operations log.
  • Has 99 of scheduled uptime
  • Schedules about 2 hours/week down.
  • Has had some operations outages
  • Has had some DOS problems.

27
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations

28
Not to throw stones but
  • Everyone has a serious problem.
  • The BEST people publish their stats.
  • The others HIDE their stats (check Netcraft
    to see who I mean).
  • We have good NODE-level availability 5-9s is
    reasonable.
  • We have TERRIBLE system-level availability 2-9s
    is the goal.

29
Recommendation 1
  • Continue progress on back-ends.
  • Make management easier (AUTOMATE IT!!!)
  • Measure
  • Compare best practices
  • Continue to look for better algoritims.
  • Live in fear
  • We are at 10,000 node servers
  • We are headed for 1,000,000 node servers

30
Recommendation 2
  • Current security approach is unworkable
  • Anonymous clients
  • Firewall is clueless
  • Incredible complexity
  • We cant win this game!
  • So change the rules (redefine the problem)
  • No anonymity
  • Unified authentication/authorization model
  • Single-function devices (with simple interfaces)
  • Only one-kind of interface (uddi/wsdl/soap/).

31
References
  • Adams, E. (1984). Optimizing Preventative
    Service of Software Products. IBM Journal of
    Research and Development. 28(1) 2-14.0
  • Anderson, T. and B. Randell. (1979). Computing
    Systems Reliability.
  • Garcia-Molina, H. and C. A. Polyzois. (1990).
    Issues in Disaster Recovery. 35th IEEE Compcon
    90. 573-577.
  • Gray, J. (1986). Why Do Computers Stop and What
    Can We Do About It. 5th Symposium on Reliability
    in Distributed Software and Database Systems.
    3-12.
  • Gray, J. (1990). A Census of Tandem System
    Availability between 1985 and 1990. IEEE
    Transactions on Reliability. 39(4) 409-418.
  • Gray, J. N., Reuter, A. (1993). Transaction
    Processing Concepts and Techniques. San Mateo,
    Morgan Kaufmann.
  • Lampson, B. W. (1981). Atomic Transactions.
    Distributed Systems -- Architecture and
    Implementation An Advanced Course. ACM,
    Springer-Verlag.
  • Laprie, J. C. (1985). Dependable Computing and
    Fault Tolerance Concepts and Terminology. 15th
    FTCS. 2-11.
  • Long, D.D., J. L. Carroll, and C.J. Park (1991).
    A study of the reliability of Internet sites.
    Proc 10th Symposium on Reliable Distributed
    Systems, pp. 177-186, Pisa, September 1991.
  • Darrell Long, Andrew Muir and Richard Golding,
    A Longitudinal Study of Internet Host
    Reliability,'' Proceedings of the Symposium on
    Reliable Distributed Systems, Bad Neuenahr,
    Germany IEEE, September 1995, p. 2-9
  • http//www.netcraft.com/ They have even better
    for-fee data as well, but for-free is really
    excellent.
  • http//www2.ebay.com/aw/announce.shtmltop eBay
    is an Excellent benchmark of best Internet
    practices
  • http//www-iepm.slac.stanford.edu/pinger/
    Network traffic/quality report, dated, but the
    others have died off!
Write a Comment
User Comments (0)
About PowerShow.com