Dependability in the Internet Era - PowerPoint PPT Presentation

About This Presentation

Title:

Dependability in the Internet Era

Description:

Wal-Mart never lost a byte (thousands of disks, hundreds of failures) ... The BEST people publish their stats. The others HIDE their stats ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 32

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: Dependability in the Internet Era

1
Dependability in the Internet Era
2
Outline

The glorious past (Availability Progress)
The dark ages (current scene)
Some recommendations

3
PreviewThe Last 5 Years Availability Dark Ages
Ready for a Renaissance?

Things got better, then things got a lot worse!

99.999
Telephone Systems
99.999
99.99
Availability
Cell phones
99.9
Computer Systems
99
Internet
9
1950
1960
1970
1980
1990
2000
4
DEPENDABILITY The 3 ITIES

RELIABILITY / INTEGRITY Does the right
thing. (also MTTFgtgt1)
AVAILABILITY Does it now. (also 1 gtgt
MTTR )
MTTFMTTRSystem AvailabilityIf 90 of
terminals up 99 of DB up? (gt89 of
transactions are serviced on time).
Holistic vs. Reductionist view

Security
Integrity
Reliability
Availability
5
Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF

Improving either MTTR or MTTF gives benefit
Simple redundancy does not help much.

6
Fault Model

Failures are independentSo, single fault
tolerance is a big win
Hardware fails fast (dead disk, blue-screen)
Software fails-fast (or goes to sleep)
Software often repaired by reboot
Heisenbugs
Operations tasks major source of outage
Utility operations
Software upgrades

7
Disks (raid) the BIG Success Story

Duplex or Parity masks faults
Disks _at_ 1M hours (100 years)
But
controllers fail and
have 1,000s of disks.
Duplexing or parity, and dual path gives perfect
disks
Wal-Mart never lost a byte (thousands of
disks, hundreds of failures).
Only software/operations mistakes are left.

8
Fault Tolerance vs Disaster Tolerance

Fault-Tolerance mask local faults
RAID disks
Uninterruptible Power Supplies
Cluster Failover
Disaster Tolerance masks site failures
Protects against fire, flood, sabotage,..
Redundant system and service at remote site.

9
Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2

Tele Comm lines
1
2

1
1
.
2
Environment

2
5

Application Software
9
.
3

Operations

Vendor (hardware and software) 5 Months
Application software 9 Months
Communications lines 1.5 Years
Operations 2 Years
Environment 2 Years
10 Weeks
1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF 10 weeks, avg
duration 90 MINUTES
To Get 10 Year MTTF, Must Attack All These Areas

10
Case Studies - Tandem Trends

MTTF improved
Shift from Hardware Maintenance to from 50 to
10
to Software (62) Operations (15)
NOTE Systematic under-reporting of Environment
Operations errors
Application Software

11
Dependability Status circa 1995

4-year MTTF gt 5 9s for well-managed sys.
Fault Tolerance Works.
Hardware is GREAT (maintenance and MTTF).
Software masks most hardware faults.
Many hidden software outages in operations
New Software.
Utilities.
Make all hardware/software changes ONLINE.
Software seems to define a 30-year MTTF ceiling.
Reasonable Goal 100-year MTTF.
class 4 today gt class 6 tomorrow.

12
Whats Happened Since Then?

Hardware got better
Software got better (even though it is more
complex)
Raid is standard, Snapshots coming standard
Cluster in a box commodity failover
Remote replication is standard.

13
Availability
9
9
9
9
9
Un-managed
Availability
well-managed nodes
Masks some hardware failures
well-managed packs clones
Masks hardware failures, Operations tasks (e.g.
software upgrades) Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire,
move,) Masks some operations failures
14
Outline

The glorious past (Availability Progress)
The dark ages (current scene)
Some recommendations

15
Progress?

MTTF improved from 1950-1995
MTTR has not improved much since 1970 failover
Hardware and Software online change (pNp) is now
standard
Then the Internet arrived
No project can take more than 3 months.
Time to market is everything
Change is good.

16
The Internet Changed Expectations

1990
Phones delivered 99.999
ATMs delivered 99.99
Failures were front-page news.
Few hackers
Outages last an hour

2000
Cellphones deliver 90
Web sites deliver 98
Failures are business-page news
Many hackers.
Outages last a day

This is progress?
17
Why (1) Complexity

Internet sites are MUCH more complex.
NAP
Firewall/proxy/ipsprayer
Web
DMZ
App server
DB server
Links to other sites
tcp/http/html/dhtml/dom/xml/com/corba/cgi/sql/fs/o
s
Skill level is much reduced

18
One of the Data Centers (500 servers)
19
A Schematic of HotMail

7,000 servers
100 backend stores with 120TB (cooked)
3 data centers
Links to
Passport
Ad-rotator
Internet Mail gateways
1B messages per day
150M mailboxes, 100M active
400,000 new per day.

20
Why (2) Velocity

No project can take more than 13 weeks.
Time to market is everything
Functionality is everything
Faster, cheaper, badder ?

21
Why (3) Hackers

Hackers are a new increased threat
Any site can be attacked from anywhere
Motives include ego, malice, and greed.
Complexity makes it hard to protect sites.
Concentration of wealth makes attractive target
Why did you rob banks?
Willie Sutton Cause thats where the money is!

Note Eric Raymonds How to Become a Hacker
http//www.tuxedo.org/esr/faqs/hacker-howto.html
is the positive use of the term, here I mean
malicious and anti-social hackers.
22
How Bad Is It?
http//www-iepm.slac.stanford.edu/
Connectivity is poor.
23
How Bad Is It?
http//www-iepm.slac.stanford.edu/pinger/

Median monthly ping packet loss for 2/ 99

24
Microsoft.Com

Operations mis-configured a router
Took a day to diagnose and repair.
DOS attacks cost a fraction of a day.
Regular security patches.

25
BackEnd Servers are More Stable

Generally deliver 99.99
TerraServer for example single back-end
failed after 2.5 y.
Went to 4-nodecluster
Fails every 2 mo.Transparent failover in 30
sec.Online software upgradesSo 99.999 in
backend

26
eBay A very honest site
http//www2.ebay.com/aw/announce.shtmltop

Publishes operations log.
Has 99 of scheduled uptime
Schedules about 2 hours/week down.
Has had some operations outages
Has had some DOS problems.

27
Outline

The glorious past (Availability Progress)
The dark ages (current scene)
Some recommendations

28
Not to throw stones but

Everyone has a serious problem.
The BEST people publish their stats.
The others HIDE their stats (check Netcraft
to see who I mean).
We have good NODE-level availability 5-9s is
reasonable.
We have TERRIBLE system-level availability 2-9s
is the goal.

29
Recommendation 1

Continue progress on back-ends.
Make management easier (AUTOMATE IT!!!)
Measure
Compare best practices
Continue to look for better algoritims.
Live in fear
We are at 10,000 node servers
We are headed for 1,000,000 node servers

30
Recommendation 2

Current security approach is unworkable
Anonymous clients
Firewall is clueless
Incredible complexity
We cant win this game!
So change the rules (redefine the problem)
No anonymity
Unified authentication/authorization model
Single-function devices (with simple interfaces)
Only one-kind of interface (uddi/wsdl/soap/).

31
References

Adams, E. (1984). Optimizing Preventative
Service of Software Products. IBM Journal of
Research and Development. 28(1) 2-14.0
Anderson, T. and B. Randell. (1979). Computing
Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990).
Issues in Disaster Recovery. 35th IEEE Compcon
90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What
Can We Do About It. 5th Symposium on Reliability
in Distributed Software and Database Systems.
3-12.
Gray, J. (1990). A Census of Tandem System
Availability between 1985 and 1990. IEEE
Transactions on Reliability. 39(4) 409-418.
Gray, J. N., Reuter, A. (1993). Transaction
Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann.
Lampson, B. W. (1981). Atomic Transactions.
Distributed Systems -- Architecture and
Implementation An Advanced Course. ACM,
Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and
Fault Tolerance Concepts and Terminology. 15th
FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991).
A study of the reliability of Internet sites.
Proc 10th Symposium on Reliable Distributed
Systems, pp. 177-186, Pisa, September 1991.
Darrell Long, Andrew Muir and Richard Golding,
A Longitudinal Study of Internet Host
Reliability,'' Proceedings of the Symposium on
Reliable Distributed Systems, Bad Neuenahr,
Germany IEEE, September 1995, p. 2-9
http//www.netcraft.com/ They have even better
for-fee data as well, but for-free is really
excellent.
http//www2.ebay.com/aw/announce.shtmltop eBay
is an Excellent benchmark of best Internet
practices
http//www-iepm.slac.stanford.edu/pinger/
Network traffic/quality report, dated, but the
others have died off!