Dependability in the Internet Era - PowerPoint PPT Presentation

About This Presentation
Title:

Dependability in the Internet Era

Description:

Lycos (80) Yahoo! (80) Ask Jeeves (7) Altavista (18) 3.29. 5.41 ... Lycos (81) Yahoo! (81) Altavista (19) Go.com. Web Sites. with Best. Performance. Averages ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 46
Provided by: jimg178
Category:

less

Transcript and Presenter's Notes

Title: Dependability in the Internet Era


1
Dependability in the Internet Era
  • Jim Gray
  • Microsoft Research
  • High Dependability Computing Consortium
    Conference
  • Santa Cruz, CA 7 May 2001
  • REVISED 13 Feb 2005 Stanford, CA

2
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations

3
PreviewThe Last 10 Years Availability Dark Ages
Ready for a Renaissance?
  • Things got better, then things got a lot worse!

99.999
99.999
99.99
Availability
99.9
99
9
1950
1960
1970
1980
1990
2000
2010
4
DEPENDABILITY The 3 ITIES
  • RELIABILITY / INTEGRITY Does the right
    thing. (also MTTFgtgt1)
  • AVAILABILITY Does it now. (also 1 gtgt
    MTTR )
    MTTFMTTRSystem AvailabilityIf 90 of
    terminals up 99 of DB up? (gt89 of
    transactions are serviced on time).
  • Holistic vs. Reductionist view

Security
Integrity
Reliability
Availability
5
Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
  • Improving either MTTR or MTTF gives benefit
  • Simple redundancy does not help much.

6
Fault Model
  • Failures are independentSo, single fault
    tolerance is a big win
  • Hardware fails fast (dead disk, blue-screen)
  • Software fails-fast (or goes to sleep)
  • Software often repaired by reboot
  • Heisenbugs
  • Operations tasks major source of outage
  • Utility operations
  • Software upgrades

7
Disks (raid) the BIG Success Story
  • Duplex or Parity masks faults
  • Disks _at_ 1M hours (100 years)
  • But
  • controllers fail and
  • have 1,000s of disks.
  • Duplexing or parity, and dual path gives perfect
    disks
  • Wal-Mart never lost a byte (thousands of
    disks, hundreds of failures).
  • Only software/operations mistakes are left.

8
Fault Tolerance vs Disaster Tolerance
  • Fault-Tolerance mask local faults
  • RAID disks
  • Uninterruptible Power Supplies
  • Cluster Failover
  • Disaster Tolerance masks site failures
  • Protects against fire, flood, sabotage,..
  • Also, software changes, site moves,
  • Redundant system and service at remote site.

9
Availability
9
9
9
9
9
Un-managed
Availability
well-managed nodes
Masks some hardware failures
well-managed packs clones
Masks hardware failures, Operations tasks (e.g.
software upgrades) Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire,
move,) Masks some operations failures
10
Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2

Tele Comm lines
1
2

1
1
.
2
Environment

2
5

Application Software
9
.
3

Operations
  • Vendor (hardware and software) 5 Months
  • Application software 9 Months
  • Communications lines 1.5 Years
  • Operations 2 Years
  • Environment 2 Years
  • 10 Weeks
  • 1,383 institutions reported (6/84 - 7/85)
  • 7,517 outages, MTTF 10 weeks, avg
    duration 90 MINUTES
  • To Get 10 Year MTTF, Must Attack All These Areas

11
Case Studies - Tandem Trends
  • MTTF improved
  • Shift from Hardware Maintenance to from 50 to
    10
  • to Software (62) Operations (15)
  • NOTE Systematic under-reporting of Environment
  • Operations errors
  • Application Software

12
Dependability Status circa 1995
  • 4-year MTTF
  • 5 9s for well-managed sys. Fault Tolerance Works.
  • Hardware is GREAT (maintenance and MTTF).
  • Software masks most hardware faults.
  • Many hidden software outages in operations
  • New Software.
  • Utilities.
  • Need to make all hardware/software changes
    ONLINE.
  • Software seems to define a 30-year MTTF ceiling.
  • Reasonable Goal 100-year MTTF.
    class 4 today gt class 6 tomorrow.

13
Honorable Mention
  • The nice folks at Tandem (now HP))
  • Made failover fast (30 seconds or less).
  • Made change online
  • Add hardware/software
  • Reorganize database.
  • Rolling upgrades.
  • Added at least one 9 to their story.

14
And Then?
  • Hardware got better ( more complex)
  • Software got better ( more complex)
  • Raid is standard, Snapshots becoming standard
  • Cluster in a box commodity failover
  • Remote replication is standard.

15
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations

16
Progress?
  • MTTF improved from 1950-1995
  • MTTR incremental improvements 1970 ---
    failover
  • Hardware and Software online change (pNp) is now
    standard
  • Then the Internet arrived
  • No project can take more than 3 months.
  • Time to market is everything
  • Change is good.

17
The Internet Changed Expectations
  • 1990
  • Phones delivered 99.999
  • ATMs delivered 99.99
  • Failures were front-page news.
  • Few hackers
  • Outages last an hour
  • 2005
  • Cell phones deliver 90
  • Web sites deliver 99
  • Failures are business-page news
  • Many hackers.
  • Outages last a day

This is progress?
18
Eric Brewer said it bestACID vs BASEthe
internet litmus testcopy of slide 8 of
http//www.ccs.neu.edu/groups/IEEE/ind-acad/brewer
/sld008.htm
  • BasicAvailabilitySoft StateEventual Consistency
  • Availability FIRST
  • Weak consistencystale data is OKApproximate
    answers OK
  • Best effort
  • Aggressive (optimistic)
  • Easier Evolution.
  • Simpler!
  • Faster
  • AtomicityConsistencyIsolation Durabilty
  • Availability?
  • Strong consistencyIsolation
  • Focus on commit
  • Conservative (Pessimistic)
  • Difficult evolution (e.g. schema)
  • Nested transactions

I think it is a spectrum
19
Why (1) Complexity
  • Internet sites are MUCH more complex.
  • NAP
  • Firewall/proxy/IPsprayer
  • Web
  • DMZ
  • App server
  • DB server
  • Links to other sites
  • tcp/http/html/dhtml/dom/xml/ com/corba/cgi/sql/fs/
    os
  • Skill level is much reduced

20
A Data Center (500 servers)
21
A Schematic of HotMail
  • 7,000 servers
  • 100 backend stores with 300TB (cooked)
  • many data centers
  • Links to
  • Internet Mail gateways
  • Ad-rotator
  • Passport
  • 5 B messages per day
  • 350M mailboxes, 250M active
  • 1M new per day.
  • New software every 3 months(small changes
    weekly).

Member
MSERVS
Front
MSERVS
Directory
Doors
Local Director
MSERVS
Local Director
MSERVS
Graphics
MSERVS
Servers
Local Director
Data
MSERVS
Data
Swittched Ethernet
MSERVS
Internet
AD Servers
Data
Data
Local Director
USTORES
Incoming
MSERVS
MSERVS
MailServer
s
Local Director
Telnet Management
MSERVS
Login
MSERVS
gateway
Servers
gateway
gateway
Local Director
gateway
gateway
22
Why (2) Velocity
  • No project can take more than 13 weeks.
  • Time to market is everything
  • Functionality is everything
  • Faster, cheaper,

23
Why (3) Hackers
  • Hackers are a new increased threat
  • Any site can be attacked from anywhere
  • Motives include ego, malice, and greed.
  • Complexity makes it hard to protect sites.
  • Whole internet attacks Slammer
  • Concentration of wealth makes attractive target
  • Reporter Why did you rob banks?
  • Willie Sutton Cause thats where the money is!

Note Eric Raymonds How to Become a Hacker
http//www.tuxedo.org/esr/faqs/hacker-howto.html
is the positive use of Hacker, here I mean
malicious and anti-social hackers. Black-hats,
not white-hats.
24
How Bad Is It?
http//www-iepm.slac.stanford.edu/
Connectivity is poor.
http//www.internettrafficreport.com/main.htm
25
How Bad Is It?
http//www-iepm.slac.stanford.edu/pinger/
  • Median monthly ping packet loss for 2/ 99

26
And in 2006, about the same
27
Or In the US
28
Keynote measures Response Timeand Up Time
Measures response time around the world Business
service is better than popular service Has many
proprietary services for SLAs.
  Week ofApril 22 - April 28, 2001 Week ofApril 22 - April 28, 2001 Previous Week Previous Week
  Index    15.90 Index    15.90 15.78 15.78
Web Siteswith BestPerformanceAverages Ameritrade (65) Lycos (81) Yahoo! (81) Altavista (19) Go.com 3.29 5.41 5.79 6.03 7.02 Ameritrade (64) Lycos (80) Yahoo! (80) Ask Jeeves (7) Altavista (18) 3.35 5.58 5.74 6.11 6.17
Worst Average (anonymous) 38.04 (anonymous) 37.44
29
2006 typical 97.48 Availability
97.48
30
Netcrafts Crisis-of-the-Day
31
(No Transcript)
32
Service Level Measurements
  • Many organizations are measured on SLAs
  • Example 1 sec response 99 of prime
    time
  • Keynote, Netcraft,
  • offer to monitor you site (probe every few min)
  • This probing can go deep into the tree to detect
    services.
  • Send alerts via email
  • Give monthly reports.

33
In addition
  • Most large sites build their own instrumentation
    (several times ?)
  • This instrumentation is elaborate and essential
    for the Network Operations Center (NOC).
  • There are attempts now to systematize itTivoli,
    OpenView, NetIQ, WhatsUP, Mom,..

34
Microsoft.Com
  • Operations mis-configured a router
  • Took a day to diagnose and repair.
  • DOS attacks cost a fraction of a day.
  • Regular security patches.

35
Back-End Servers are More Stable
  • Generally deliver 99.99
  • TerraServer for example single back-end
    failed after 2.5 y.
  • Went to 4-nodecluster
  • Fails every 2 mo.Transparent failover in 30
    sec.Online software upgradesSo 99.999 in
    backend

36
eBay A very honest site
http//www2.ebay.com/aw/announce.shtml
  • Publishes operations log.
  • Has 99 of scheduled uptime
  • Schedules about 2 hours/week down.
  • Has had some operations outages
  • Has had some DOS problems.

37
And 2006.
http//www2.ebay.com/aw/announce.shtml
  • Welcome to eBay's System Board. Visit this board
    for information on scheduled site maintenance or
    system issues that are affecting Marketplace
    trading. For general eBay news, please see our
    General Announcements Board.
  • Resolved - PayPal site slowness
  • February 08, 2006 0520PM PST/PTFor several
    hours today, members may have experienced
    slowness while trying to access the PayPal
    website. This issue has now been resolved.
    AThank you for your patience.
  • Link to this announcement Back to top
  • PayPal site slowness
  • February 08, 2006 0238PM PST/PTMembers may be
    experiencing intermittent slowness while trying
    to access the PayPal website. We're aware of this
    issue and are working to fix it as quickly as
    possible. Thank you for your patience.
  • Link to this announcement Back to top
  • Scheduled Maintenance For This Week
  • February 08, 2006 0203PM PST/PTThe eBay
    system will be undergoing general maintenance
    from approximately 2300 PT on Thursday, February
    9th to 0100 PT on Friday, February 10th. During
    this maintenance period, certain eBay site
    features may be intermittently unavailable or
    slow.

38
Some Cool New Things
  • There are 100,000 node services.
  • Google File System shows importance benefit
    of Triplex
  • DB replication mirroring works (is easy)
  • little things I have done
  • With Leslie Lamport unified Paxos 2PC
  • Measured mean-time-to-data-loss(and continue to
    measure things).

39
Outline
  • The glorious past (Availability Progress)
  • The dark ages (current scene)
  • Some recommendations

40
Not to throw stones but
  • Everyone has a serious problem.
  • The BEST people publish their stats.
  • The others HIDE their stats (check Netcraft
    to see who I mean).
  • We have good NODE-level availability 5-9s is
    reasonable.
  • We have TERRIBLE system-level availability 2-9s
    scheduled is the goal (!).

41
Greshams Lawbad money drives out good
  • People WANT features!
  • People WANT convenience!
  • People WANT cheap!
  • In exchange,they seem to be willing to tolerate
    some
  • Un-availability ( inconvenience)
  • Dirty data that needs reconciliation
  • Insecurity
  • I see it as our task to make it easier
    cheaperto get high availability and Security.

42
Recommendation 1
  • Continue progress on back-ends.
  • Make management easier (AUTOMATE IT!!!)
  • Measure
  • Compare best practices
  • Continue to look for better algoritims.
  • Live in fear
  • We are at 10,000 node servers
  • We are headed for 1,000,000 node servers

43
Recommendation 2
  • Current security approach is unworkable
  • Anonymous clients
  • Firewall is clueless
  • Incredible complexity
  • We cant win this game!
  • So change the rules (redefine the problem)
  • No anonymity
  • Unified authentication/authorization model
  • Single-function devices (with simple interfaces)
  • Only one-kind of interface (uddi/wsdl/soap/).

44
Recommendation 3
  • Dependability requires holistic not
    reductionist approach.
  • Its the WHOLE system (end-to-end,
    top-to-bottom)
  • Hard to publish in this area, hard to get tenure.
  • Journals want theoremproof and crisp statements.
  • Companies want to make money, so do not
    share their knowledge.
  • Dependability is an important social good,
  • So, it Dependability Research needs
    government or philanthropic sponsorship

45
References
  • Adams, E. (1984). Optimizing Preventative
    Service of Software Products. IBM Journal of
    Research and Development. 28(1) 2-14.0
  • Anderson, T. and B. Randell. (1979). Computing
    Systems Reliability.
  • Garcia-Molina, H. and C. A. Polyzois. (1990).
    Issues in Disaster Recovery. 35th IEEE Compcon
    90. 573-577.
  • Gray, J. (1986). Why Do Computers Stop and What
    Can We Do About It. 5th Symposium on Reliability
    in Distributed Software and Database Systems.
    3-12.
  • Gray, J. (1990). A Census of Tandem System
    Availability between 1985 and 1990. IEEE
    Transactions on Reliability. 39(4) 409-418.
  • Gray, J. N., Reuter, A. (1993). Transaction
    Processing Concepts and Techniques. San Mateo,
    Morgan Kaufmann.
  • Lampson, B. W. (1981). Atomic Transactions.
    Distributed Systems -- Architecture and
    Implementation An Advanced Course. ACM,
    Springer-Verlag.
  • Laprie, J. C. (1985). Dependable Computing and
    Fault Tolerance Concepts and Terminology. 15th
    FTCS. 2-11.
  • Long, D.D., J. L. Carroll, and C.J. Park (1991).
    A study of the reliability of Internet sites.
    Proc 10th Symposium on Reliable Distributed
    Systems, pp. 177-186, Pisa, September 1991.
  • Theory and Practice of Reliable System Design,
    Dan Siewiorek, Robert Swarz
  • Building Secure and Reliable Network
    Applications, Ken P. Birman
  • Darrell Long, Andrew Muir and Richard Golding,
    A Longitudinal Study of Internet Host
    Reliability,'' Proc of the Symposium on Reliable
    Distributed Systems, Bad Neuenahr, Germany IEEE,
    1995, p. 2-9
  • http//www.netcraft.com/ They have even better
    for-fee data as well, but for-free is really
    excellent.
  • http//www2.ebay.com/aw/announce.shtmltop eBay
    is an Excellent benchmark of best Internet
    practices
  • Empirical Measurements of Disk Failure Rates and
    Error Rates C .van Ingen moving 2P with cheap
    iron
  • Consensus on Transaction Commit, , L. Lamport,
    unifies 2PC and Byzantie-Paxos   
Write a Comment
User Comments (0)
About PowerShow.com