Title: Dependability in the Internet Era
1Dependability in the Internet Era
2Outline
- The glorious past (Availability Progress)
- The dark ages (current scene)
- Some recommendations
3PreviewThe Last 5 Years Availability Dark Ages
Ready for a Renaissance?
- Things got better, then things got a lot worse!
99.999
Telephone Systems
99.999
99.99
Availability
Cell phones
99.9
Computer Systems
99
Internet
9
1950
1960
1970
1980
1990
2000
4DEPENDABILITY The 3 ITIES
- RELIABILITY / INTEGRITY Does the right
thing. (also MTTFgtgt1) - AVAILABILITY Does it now. (also 1 gtgt
MTTR )
MTTFMTTRSystem AvailabilityIf 90 of
terminals up 99 of DB up? (gt89 of
transactions are serviced on time). - Holistic vs. Reductionist view
Security
Integrity
Reliability
Availability
5Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
-
- Improving either MTTR or MTTF gives benefit
- Simple redundancy does not help much.
6Fault Model
- Failures are independentSo, single fault
tolerance is a big win - Hardware fails fast (dead disk, blue-screen)
- Software fails-fast (or goes to sleep)
- Software often repaired by reboot
- Heisenbugs
- Operations tasks major source of outage
- Utility operations
- Software upgrades
7Disks (raid) the BIG Success Story
- Duplex or Parity masks faults
- Disks _at_ 1M hours (100 years)
- But
- controllers fail and
- have 1,000s of disks.
- Duplexing or parity, and dual path gives perfect
disks - Wal-Mart never lost a byte (thousands of
disks, hundreds of failures). - Only software/operations mistakes are left.
8Fault Tolerance vs Disaster Tolerance
- Fault-Tolerance mask local faults
- RAID disks
- Uninterruptible Power Supplies
- Cluster Failover
- Disaster Tolerance masks site failures
- Protects against fire, flood, sabotage,..
- Redundant system and service at remote site.
9Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2
Tele Comm lines
1
2
1
1
.
2
Environment
2
5
Application Software
9
.
3
Operations
- Vendor (hardware and software) 5 Months
- Application software 9 Months
- Communications lines 1.5 Years
- Operations 2 Years
- Environment 2 Years
- 10 Weeks
- 1,383 institutions reported (6/84 - 7/85)
- 7,517 outages, MTTF 10 weeks, avg
duration 90 MINUTES - To Get 10 Year MTTF, Must Attack All These Areas
10Case Studies - Tandem Trends
- MTTF improved
- Shift from Hardware Maintenance to from 50 to
10 - to Software (62) Operations (15)
- NOTE Systematic under-reporting of Environment
- Operations errors
- Application Software
11Dependability Status circa 1995
- 4-year MTTF gt 5 9s for well-managed sys.
Fault Tolerance Works. - Hardware is GREAT (maintenance and MTTF).
- Software masks most hardware faults.
- Many hidden software outages in operations
- New Software.
- Utilities.
- Make all hardware/software changes ONLINE.
- Software seems to define a 30-year MTTF ceiling.
- Reasonable Goal 100-year MTTF.
class 4 today gt class 6 tomorrow.
12Whats Happened Since Then?
- Hardware got better
- Software got better (even though it is more
complex) - Raid is standard, Snapshots coming standard
- Cluster in a box commodity failover
- Remote replication is standard.
13Availability
9
9
9
9
9
Un-managed
Availability
well-managed nodes
Masks some hardware failures
well-managed packs clones
Masks hardware failures, Operations tasks (e.g.
software upgrades) Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire,
move,) Masks some operations failures
14Outline
- The glorious past (Availability Progress)
- The dark ages (current scene)
- Some recommendations
15Progress?
- MTTF improved from 1950-1995
- MTTR has not improved much since 1970 failover
- Hardware and Software online change (pNp) is now
standard - Then the Internet arrived
- No project can take more than 3 months.
- Time to market is everything
- Change is good.
16The Internet Changed Expectations
- 1990
- Phones delivered 99.999
- ATMs delivered 99.99
- Failures were front-page news.
- Few hackers
- Outages last an hour
- 2000
- Cellphones deliver 90
- Web sites deliver 98
- Failures are business-page news
- Many hackers.
- Outages last a day
This is progress?
17Why (1) Complexity
- Internet sites are MUCH more complex.
- NAP
- Firewall/proxy/ipsprayer
- Web
- DMZ
- App server
- DB server
- Links to other sites
- tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/o
s - Skill level is much reduced
18One of the Data Centers (500 servers)
19A Schematic of HotMail
- 7,000 servers
- 100 backend stores with 120TB (cooked)
- 3 data centers
- Links to
- Passport
- Ad-rotator
- Internet Mail gateways
-
- 1B messages per day
- 150M mailboxes, 100M active
- 400,000 new per day.
20Why (2) Velocity
- No project can take more than 13 weeks.
- Time to market is everything
- Functionality is everything
- Faster, cheaper, badder ?
21Why (3) Hackers
- Hackers are a new increased threat
- Any site can be attacked from anywhere
- Motives include ego, malice, and greed.
- Complexity makes it hard to protect sites.
- Concentration of wealth makes attractive target
- Why did you rob banks?
- Willie Sutton Cause thats where the money is!
Note Eric Raymonds How to Become a Hacker
http//www.tuxedo.org/esr/faqs/hacker-howto.html
is the positive use of the term, here I mean
malicious and anti-social hackers.
22How Bad Is It?
http//www-iepm.slac.stanford.edu/
Connectivity is poor.
23How Bad Is It?
http//www-iepm.slac.stanford.edu/pinger/
- Median monthly ping packet loss for 2/ 99
24Microsoft.Com
- Operations mis-configured a router
- Took a day to diagnose and repair.
- DOS attacks cost a fraction of a day.
- Regular security patches.
25BackEnd Servers are More Stable
- Generally deliver 99.99
- TerraServer for example single back-end
failed after 2.5 y. - Went to 4-nodecluster
- Fails every 2 mo.Transparent failover in 30
sec.Online software upgradesSo 99.999 in
backend
26eBay A very honest site
http//www2.ebay.com/aw/announce.shtmltop
- Publishes operations log.
- Has 99 of scheduled uptime
- Schedules about 2 hours/week down.
- Has had some operations outages
- Has had some DOS problems.
27Outline
- The glorious past (Availability Progress)
- The dark ages (current scene)
- Some recommendations
28Not to throw stones but
- Everyone has a serious problem.
- The BEST people publish their stats.
- The others HIDE their stats (check Netcraft
to see who I mean). - We have good NODE-level availability 5-9s is
reasonable. - We have TERRIBLE system-level availability 2-9s
is the goal.
29Recommendation 1
- Continue progress on back-ends.
- Make management easier (AUTOMATE IT!!!)
- Measure
- Compare best practices
- Continue to look for better algoritims.
- Live in fear
- We are at 10,000 node servers
- We are headed for 1,000,000 node servers
30Recommendation 2
- Current security approach is unworkable
- Anonymous clients
- Firewall is clueless
- Incredible complexity
- We cant win this game!
- So change the rules (redefine the problem)
- No anonymity
- Unified authentication/authorization model
- Single-function devices (with simple interfaces)
- Only one-kind of interface (uddi/wsdl/soap/).
31References
- Adams, E. (1984). Optimizing Preventative
Service of Software Products. IBM Journal of
Research and Development. 28(1) 2-14.0 - Anderson, T. and B. Randell. (1979). Computing
Systems Reliability. - Garcia-Molina, H. and C. A. Polyzois. (1990).
Issues in Disaster Recovery. 35th IEEE Compcon
90. 573-577. - Gray, J. (1986). Why Do Computers Stop and What
Can We Do About It. 5th Symposium on Reliability
in Distributed Software and Database Systems.
3-12. - Gray, J. (1990). A Census of Tandem System
Availability between 1985 and 1990. IEEE
Transactions on Reliability. 39(4) 409-418. - Gray, J. N., Reuter, A. (1993). Transaction
Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann. - Lampson, B. W. (1981). Atomic Transactions.
Distributed Systems -- Architecture and
Implementation An Advanced Course. ACM,
Springer-Verlag. - Laprie, J. C. (1985). Dependable Computing and
Fault Tolerance Concepts and Terminology. 15th
FTCS. 2-11. - Long, D.D., J. L. Carroll, and C.J. Park (1991).
A study of the reliability of Internet sites.
Proc 10th Symposium on Reliable Distributed
Systems, pp. 177-186, Pisa, September 1991. - Darrell Long, Andrew Muir and Richard Golding,
A Longitudinal Study of Internet Host
Reliability,'' Proceedings of the Symposium on
Reliable Distributed Systems, Bad Neuenahr,
Germany IEEE, September 1995, p. 2-9 - http//www.netcraft.com/ They have even better
for-fee data as well, but for-free is really
excellent. - http//www2.ebay.com/aw/announce.shtmltop eBay
is an Excellent benchmark of best Internet
practices - http//www-iepm.slac.stanford.edu/pinger/
Network traffic/quality report, dated, but the
others have died off!