Title: Monitoring, Diagnosing, and Repairing
1Monitoring, Diagnosing, and Repairing
- Eric Anderson
- U.C. Berkeley
2Overview
- What is System Administration?
- What is the problem?
- Goals of Dissertation Research
- Goals of System Administration
- Monitoring, diagnosing, and repairing
- Dissertation Timeline
- Conclusion
3What is the problem?
- Problems occur in systems, and result in loss of
productivity - Server failures ? denial of service
- System overload ? lower productivity
- Cost is too high
- Cost of ownership estimated at 5,000-15,000/year
/machine - Median salary (50k) / (median machines/admin)
? 700 - Our goal Reduce cost by
- Repairing problems faster (possibly
automatically) - Handling more problems
4Goals of Dissertation Research
- Describe field of System Administration
- Monitoring, Diagnosing, and Repairing
- Approach Synthesize solutions from other fields
of research - 1) Detect previously ignored problems
- 2) Automatic repair of some problems
- 3) Reduce number of administrators needed
- 4) Support users understanding of system
- Apply here distribute software
- Thesis Through our approach, we can achieve
goals 1-4.
5Goals of System Administration
- Goal Support cost-effective use of the computer
environment - More specifically (some non-technical)
- Environment uniform, customizable, high
performance and available - Faults errors recovery from benign errors,
protection from malicious attacks - Users training, accounting planning, legal
6Monitoring, Diagnosing, and Repairing (MDR)
- Introductory examples
- Fundamental requirements
- Environmental constraints
- Previous work
- Six key innovations
- Architecture
- Details on innovations
- Evaluation methodology
7MDR Examples Intro
- Four examples
- 1) Broken component
- 2) Resource overload transient
- 3) Resource contention user program
- 4) Resource exhaustion long term
- Previous Solutions
- Pay someone to watch
- Ignore or wait for someone to complain
- Specialized scripts (not general ? vast repeated
work)
8MDR Example 1
- Web server has crashed/hung
- Gather information process existence, service
uptime, restart times - Analyze data process not responding, and hasnt
been recently restarted. - Automatic repair restart daemon.
- Notify administrator had to restart daemon.
9MDR Example 2
- The NOW is slow.
- Gather data load, process info, CPU info
- Analyze data bounds on expected values
- Notified administrator fileserver overloaded.
- Visualize data nfsds are overloaded.
- Repair admin moves data, adds disks, or starts
more nfsds
10MDR Example 3
- User running program
- Gather user statistics, CPU, disk
- Visualize spending too much time waiting on
remote accesses - (User fixes program, gathering, visualization
repeated) - Analyze some nodes have less throughput
- Visualize those have other jobs running on them
- Repair user is benchmarking so kills all
extraneous processes
11MDR Example 4
- Web server increasing beyond capacity
- Gather CPU, request rate, reply latency
- Analyze Burst lengths getting longer, latency
increasing - Visualize Graph of burst lengths CPU usage
over time - Repair Order more machines, install load balancer
12MDR Fundamental Requirements
- Gathering
- Flexible data gathering, self-describing storage
- Analyzing
- Calculate statistical measures, identify relevant
statistics. - Notifying
- Flexible infrequent messages to administrators or
users - Visualizing
- Maximize information/pixel, support multiple
interfaces - Repairing
- Automate simple repairs, support group operations
13MDR Environmental Constraints
- Change is inherent
- Lack of Web/Mbone 5 years ago, now most/many have
these. - Problems on many time-scales
- Second-Minute transients vs. Week-Month capacity
problems - Must operate under very adverse conditions
- Often used when system is broken
- Would like at least post-mortum analysis
- Need to handle hundreds thousands of nodes
- Scalability All sites are getting larger,
possibly wide area - Our system has 200 (NOW) 2000 (Soda) nodes
14MDR Previous Systems
- Many previous systems Ive looked at about 16.
- Not comprehensive, not extensible.
- Look at a few that did a nice job of a piece
- Fink97 Run test, notify display engine
- Easy to add tests
- Selectivity of notification good
- Tests are just programs (redo gathering)
- Central, non-fault tolerant solution
- Many hard coded constants
15MDR Previous Systems, cont.
- Hard92 buzzerd Pager notification system
- Flexible rules for notification
- External interface for adding notify requests
- Simplistic gathering
- Poor fault tolerance
- Pier96 Igor group fixes
- Flexible operations
- Nice reporting of success/failure
- Weak security, runs as root
- No delegation of responsibility
16MDR Six Key Innovations (1-3)
- Replicated, semi-hierarchical, data storage nodes
- Rendezvous point for programs
- Handles scaling and fault-tolerance
- Self describing structures
- Functions (visualize, summarize) data go in
database (OO) - DB has machine and human readable descriptions of
data - End to end notification
- Detect problems in MDR system
- Guarantee important messages get to users
17MDR Six Key Innovations (4-6)
- Aggregation and High Resolution Color Displays
- Reduce information to manageable amounts
- Maximize information per unit area
- Partially self-configuring
- Learn averages, deviations, burst sizes
- Learn which values are relevant to problems
- Secure, user-specified group repairs
- Dont enable malicious attacks
- Automate repairs of many machines
18MDR Architecture
Users and Administrators
19MDR-Arch Derivations
from Databases and Software Engineering
Daemon Restarter
from Expert Systems
Tolerance, Relevance Learner
from User Interfaces
from Artificial Intelligence
from Expert Systems
20Key Semi-Hier. DBs.
- Fault tolerance
- Scalability
- Caches dont need to commit to disk
authoritative copy elsewhere. - Batching updates over wide area links.
21Key Self-Describing
- De-couple data gathering, data storage, and data
use - Self-Describing for Humans
- Descriptions of meanings of values stored with
tables - Description of methods of gathering stored with
tables - Column names help with self
- Self-Describing for Computers
- Functions for visualizing or summarizing data
- Indication of resource selection from resource
statistics
22Key End-to-End Notification
- Recall System must operate under extreme
conditions - Humans must validate that system is still working
- Standalone display can indicate timestamps, mark
out of date data - Wireless machine could intermittently contact
notification system - Pager could be automatically paged every so often
- Problems should be propagated to end users.
- Flexible notification connected systems,
e-mail, pager. - Limit over-notification
23Key Aggregation HiRes
- System target has hundreds thousands of nodes
- Aggregate by showing out of bounds, relevant
values (via automatic tuning) - Also want overview of system
- Aggregate across similar statistics show value
(fill) dispersion (shade) - Use color to highlight important values.
- Aggregate across values (machine utilization
CPU disk memory) - Maximize data/pixel Tufte
24Key Agg HiRes Snapshot
25Key Self-Configuring
- Single statistics
- Phase 1 Calculate averages, standard deviations,
burst sizes - Worked in other systems Jaco88, Karn91
- Identify relevant statistics
- Give system Boolean examples (variables out of
bounds, and system working/not working) get
function. - Works for Boolean disjunctions in some cases
- With lots of irrelevant variables Litt89
- With random bad examples Sloa89
- In some cases, with malicious bad examples
Ande94
26Key Secure Remote Actions
- Security because of malicious attacks, benign
errors - Delegation to remove SA from the loop
- Independence from particular algorithms
- Building a library
- Program with principals (hosts, users), and
properties (signed, sealed, verifiable) - Use secure, run-time extensible languages
- Actions report through gathering system
27MDR Testing Methodology
- Fault injection
- Deliberately make the system slow
- Break hardware/software components
- Feature comparison
- Paper comparison with other systems
- Usage in practice
- Experience important to show system works
- We have need of administrative tools
- Testimonials
- Experience at other sites lends credibility
28MDR Demo
- Hierarchical structure working (1 level right
now) - Alternative Interface
- Fault Injection
- Need for Aggregation
- Crufty right now
- Demo
29Timeline Key Pieces
- 1) (DBs) Replicated, semi-hierarchical, data
storage nodes - 2) (SDS) Self describing structures
- 3) (Vis) Aggregation and High Resolution Color
Displays - 4) (E2EN) End to end notification
- 5) (ReS) Automatic Restart
- 6) (Cfg) Partially self-configuring
- 7) (Rep) Secure, user-specified group repairs
30Timeline
Deadlines
LISA 6/97
USENIX 12/97
OSDI 3/98
LISA 6/98
Graduation 12/98
SOSP 3/99
Mar, 1999
31Conclusion
- Description of field shows breadth
- Monitoring, diagnosing, and repairing shows depth
- Examples show importance of problem
- Fundamental goals environmental constraints
show understanding of problem - Key innovations show differences from previous
systems. - Architecture and initial prototype show approach
to problem - Testing methods show ways to validate solution.
- Timeline shows plan milestones to graduation
32Old Slides
33Solutions
- Managing stable storage
- Supporting users
- Simplifying security
- Monitoring, diagnosing, and repairing
34Managing Stable Storage
- Consistency vs. availability
- Fault tolerance
- Scalability
- Recoverability
- Customization
35Supporting Users
- Automated help desk
- Searchable collection of questions
- Easy method for addition
- Remote device access
- Site-wide training
36Goals Environment
- Uniform
- Supports user mobility by eliminating arbitrary
changes - Increases effectiveness by avoiding need for
users to learn multiple interfaces - Customizable
- Handles special systems and special needs
firewalls, servers - Obviously reduces uniformity
37Goals Environment, cont.
- High Performance
- Increases effectiveness of users HCI/psych
- Limited by cost-effectiveness
- Available
- Effectiveness is 0 if system isnt working
- Balanced against expense
38Goals Faults Errors
- Benign errors
- Accidentally deleted files
- Unnoticed runaway processes
- Malicious attacks
- TCP SYN attack
- Sendmail bugs
- Data stealing
- False data injection
39Goals Users
- Training
- Troubleshooting one-on-one training
- Larger sessions classes
- Accounting
- Supports management, helps billing
- Capacity Planning
- Expanding systems takes time
- Legal
- Sensitive information needs protection
40Simplifying Security
- USENIX talk says If cryptography is so great,
why isnt it used more? - SAs worry about security to protect data.
- Goal Ease development of secure applications
- Write programs using principals properties
rather than keys and algorithms - Unify various forms of available cryptography
(public key, secret-key, PGP, Kerberos) - My use protected, transferable rights to allow
various actions - Modify system configurations (add filesystems,
printers) - Kill/restart processes (runaway, after
configurations modified) - Access data (private logs, for backups, etc.)
41Conclusion
- System administration as area of research
- Description of field
- Areas for future research
- Managing stable storage
- Supporting users
- Initial investigation of research area
- Monitoring, diagnosing, and repairing
- Broad, draws from many fields