Monitoring, Diagnosing, and Repairing - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Monitoring, Diagnosing, and Repairing

Description:

June, 1998. LISA 6/97. USENIX 12/97. OSDI 3/98. Graduation 12/98. Prototype 1,2,3 (DBs, SelfD, Vis) ... Sendmail bugs. Data stealing. False data injection. 39 ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 42
Provided by: erica180
Category:

less

Transcript and Presenter's Notes

Title: Monitoring, Diagnosing, and Repairing


1
Monitoring, Diagnosing, and Repairing
  • Eric Anderson
  • U.C. Berkeley

2
Overview
  • What is System Administration?
  • What is the problem?
  • Goals of Dissertation Research
  • Goals of System Administration
  • Monitoring, diagnosing, and repairing
  • Dissertation Timeline
  • Conclusion

3
What is the problem?
  • Problems occur in systems, and result in loss of
    productivity
  • Server failures ? denial of service
  • System overload ? lower productivity
  • Cost is too high
  • Cost of ownership estimated at 5,000-15,000/year
    /machine
  • Median salary (50k) / (median machines/admin)
    ? 700
  • Our goal Reduce cost by
  • Repairing problems faster (possibly
    automatically)
  • Handling more problems

4
Goals of Dissertation Research
  • Describe field of System Administration
  • Monitoring, Diagnosing, and Repairing
  • Approach Synthesize solutions from other fields
    of research
  • 1) Detect previously ignored problems
  • 2) Automatic repair of some problems
  • 3) Reduce number of administrators needed
  • 4) Support users understanding of system
  • Apply here distribute software
  • Thesis Through our approach, we can achieve
    goals 1-4.

5
Goals of System Administration
  • Goal Support cost-effective use of the computer
    environment
  • More specifically (some non-technical)
  • Environment uniform, customizable, high
    performance and available
  • Faults errors recovery from benign errors,
    protection from malicious attacks
  • Users training, accounting planning, legal

6
Monitoring, Diagnosing, and Repairing (MDR)
  • Introductory examples
  • Fundamental requirements
  • Environmental constraints
  • Previous work
  • Six key innovations
  • Architecture
  • Details on innovations
  • Evaluation methodology

7
MDR Examples Intro
  • Four examples
  • 1) Broken component
  • 2) Resource overload transient
  • 3) Resource contention user program
  • 4) Resource exhaustion long term
  • Previous Solutions
  • Pay someone to watch
  • Ignore or wait for someone to complain
  • Specialized scripts (not general ? vast repeated
    work)

8
MDR Example 1
  • Web server has crashed/hung
  • Gather information process existence, service
    uptime, restart times
  • Analyze data process not responding, and hasnt
    been recently restarted.
  • Automatic repair restart daemon.
  • Notify administrator had to restart daemon.

9
MDR Example 2
  • The NOW is slow.
  • Gather data load, process info, CPU info
  • Analyze data bounds on expected values
  • Notified administrator fileserver overloaded.
  • Visualize data nfsds are overloaded.
  • Repair admin moves data, adds disks, or starts
    more nfsds

10
MDR Example 3
  • User running program
  • Gather user statistics, CPU, disk
  • Visualize spending too much time waiting on
    remote accesses
  • (User fixes program, gathering, visualization
    repeated)
  • Analyze some nodes have less throughput
  • Visualize those have other jobs running on them
  • Repair user is benchmarking so kills all
    extraneous processes

11
MDR Example 4
  • Web server increasing beyond capacity
  • Gather CPU, request rate, reply latency
  • Analyze Burst lengths getting longer, latency
    increasing
  • Visualize Graph of burst lengths CPU usage
    over time
  • Repair Order more machines, install load balancer

12
MDR Fundamental Requirements
  • Gathering
  • Flexible data gathering, self-describing storage
  • Analyzing
  • Calculate statistical measures, identify relevant
    statistics.
  • Notifying
  • Flexible infrequent messages to administrators or
    users
  • Visualizing
  • Maximize information/pixel, support multiple
    interfaces
  • Repairing
  • Automate simple repairs, support group operations

13
MDR Environmental Constraints
  • Change is inherent
  • Lack of Web/Mbone 5 years ago, now most/many have
    these.
  • Problems on many time-scales
  • Second-Minute transients vs. Week-Month capacity
    problems
  • Must operate under very adverse conditions
  • Often used when system is broken
  • Would like at least post-mortum analysis
  • Need to handle hundreds thousands of nodes
  • Scalability All sites are getting larger,
    possibly wide area
  • Our system has 200 (NOW) 2000 (Soda) nodes

14
MDR Previous Systems
  • Many previous systems Ive looked at about 16.
  • Not comprehensive, not extensible.
  • Look at a few that did a nice job of a piece
  • Fink97 Run test, notify display engine
  • Easy to add tests
  • Selectivity of notification good
  • Tests are just programs (redo gathering)
  • Central, non-fault tolerant solution
  • Many hard coded constants

15
MDR Previous Systems, cont.
  • Hard92 buzzerd Pager notification system
  • Flexible rules for notification
  • External interface for adding notify requests
  • Simplistic gathering
  • Poor fault tolerance
  • Pier96 Igor group fixes
  • Flexible operations
  • Nice reporting of success/failure
  • Weak security, runs as root
  • No delegation of responsibility

16
MDR Six Key Innovations (1-3)
  • Replicated, semi-hierarchical, data storage nodes
  • Rendezvous point for programs
  • Handles scaling and fault-tolerance
  • Self describing structures
  • Functions (visualize, summarize) data go in
    database (OO)
  • DB has machine and human readable descriptions of
    data
  • End to end notification
  • Detect problems in MDR system
  • Guarantee important messages get to users

17
MDR Six Key Innovations (4-6)
  • Aggregation and High Resolution Color Displays
  • Reduce information to manageable amounts
  • Maximize information per unit area
  • Partially self-configuring
  • Learn averages, deviations, burst sizes
  • Learn which values are relevant to problems
  • Secure, user-specified group repairs
  • Dont enable malicious attacks
  • Automate repairs of many machines

18
MDR Architecture
Users and Administrators
19
MDR-Arch Derivations
from Databases and Software Engineering
Daemon Restarter
from Expert Systems
Tolerance, Relevance Learner
from User Interfaces
from Artificial Intelligence
from Expert Systems
20
Key Semi-Hier. DBs.
  • Fault tolerance
  • Scalability
  • Caches dont need to commit to disk
    authoritative copy elsewhere.
  • Batching updates over wide area links.

21
Key Self-Describing
  • De-couple data gathering, data storage, and data
    use
  • Self-Describing for Humans
  • Descriptions of meanings of values stored with
    tables
  • Description of methods of gathering stored with
    tables
  • Column names help with self
  • Self-Describing for Computers
  • Functions for visualizing or summarizing data
  • Indication of resource selection from resource
    statistics

22
Key End-to-End Notification
  • Recall System must operate under extreme
    conditions
  • Humans must validate that system is still working
  • Standalone display can indicate timestamps, mark
    out of date data
  • Wireless machine could intermittently contact
    notification system
  • Pager could be automatically paged every so often
  • Problems should be propagated to end users.
  • Flexible notification connected systems,
    e-mail, pager.
  • Limit over-notification

23
Key Aggregation HiRes
  • System target has hundreds thousands of nodes
  • Aggregate by showing out of bounds, relevant
    values (via automatic tuning)
  • Also want overview of system
  • Aggregate across similar statistics show value
    (fill) dispersion (shade)
  • Use color to highlight important values.
  • Aggregate across values (machine utilization
    CPU disk memory)
  • Maximize data/pixel Tufte

24
Key Agg HiRes Snapshot
25
Key Self-Configuring
  • Single statistics
  • Phase 1 Calculate averages, standard deviations,
    burst sizes
  • Worked in other systems Jaco88, Karn91
  • Identify relevant statistics
  • Give system Boolean examples (variables out of
    bounds, and system working/not working) get
    function.
  • Works for Boolean disjunctions in some cases
  • With lots of irrelevant variables Litt89
  • With random bad examples Sloa89
  • In some cases, with malicious bad examples
    Ande94

26
Key Secure Remote Actions
  • Security because of malicious attacks, benign
    errors
  • Delegation to remove SA from the loop
  • Independence from particular algorithms
  • Building a library
  • Program with principals (hosts, users), and
    properties (signed, sealed, verifiable)
  • Use secure, run-time extensible languages
  • Actions report through gathering system

27
MDR Testing Methodology
  • Fault injection
  • Deliberately make the system slow
  • Break hardware/software components
  • Feature comparison
  • Paper comparison with other systems
  • Usage in practice
  • Experience important to show system works
  • We have need of administrative tools
  • Testimonials
  • Experience at other sites lends credibility

28
MDR Demo
  • Hierarchical structure working (1 level right
    now)
  • Alternative Interface
  • Fault Injection
  • Need for Aggregation
  • Crufty right now
  • Demo

29
Timeline Key Pieces
  • 1) (DBs) Replicated, semi-hierarchical, data
    storage nodes
  • 2) (SDS) Self describing structures
  • 3) (Vis) Aggregation and High Resolution Color
    Displays
  • 4) (E2EN) End to end notification
  • 5) (ReS) Automatic Restart
  • 6) (Cfg) Partially self-configuring
  • 7) (Rep) Secure, user-specified group repairs

30
Timeline
Deadlines
LISA 6/97
USENIX 12/97
OSDI 3/98
LISA 6/98
Graduation 12/98
SOSP 3/99
Mar, 1999
31
Conclusion
  • Description of field shows breadth
  • Monitoring, diagnosing, and repairing shows depth
  • Examples show importance of problem
  • Fundamental goals environmental constraints
    show understanding of problem
  • Key innovations show differences from previous
    systems.
  • Architecture and initial prototype show approach
    to problem
  • Testing methods show ways to validate solution.
  • Timeline shows plan milestones to graduation

32
Old Slides
33
Solutions
  • Managing stable storage
  • Supporting users
  • Simplifying security
  • Monitoring, diagnosing, and repairing

34
Managing Stable Storage
  • Consistency vs. availability
  • Fault tolerance
  • Scalability
  • Recoverability
  • Customization

35
Supporting Users
  • Automated help desk
  • Searchable collection of questions
  • Easy method for addition
  • Remote device access
  • Site-wide training

36
Goals Environment
  • Uniform
  • Supports user mobility by eliminating arbitrary
    changes
  • Increases effectiveness by avoiding need for
    users to learn multiple interfaces
  • Customizable
  • Handles special systems and special needs
    firewalls, servers
  • Obviously reduces uniformity

37
Goals Environment, cont.
  • High Performance
  • Increases effectiveness of users HCI/psych
  • Limited by cost-effectiveness
  • Available
  • Effectiveness is 0 if system isnt working
  • Balanced against expense

38
Goals Faults Errors
  • Benign errors
  • Accidentally deleted files
  • Unnoticed runaway processes
  • Malicious attacks
  • TCP SYN attack
  • Sendmail bugs
  • Data stealing
  • False data injection

39
Goals Users
  • Training
  • Troubleshooting one-on-one training
  • Larger sessions classes
  • Accounting
  • Supports management, helps billing
  • Capacity Planning
  • Expanding systems takes time
  • Legal
  • Sensitive information needs protection

40
Simplifying Security
  • USENIX talk says If cryptography is so great,
    why isnt it used more?
  • SAs worry about security to protect data.
  • Goal Ease development of secure applications
  • Write programs using principals properties
    rather than keys and algorithms
  • Unify various forms of available cryptography
    (public key, secret-key, PGP, Kerberos)
  • My use protected, transferable rights to allow
    various actions
  • Modify system configurations (add filesystems,
    printers)
  • Kill/restart processes (runaway, after
    configurations modified)
  • Access data (private logs, for backups, etc.)

41
Conclusion
  • System administration as area of research
  • Description of field
  • Areas for future research
  • Managing stable storage
  • Supporting users
  • Initial investigation of research area
  • Monitoring, diagnosing, and repairing
  • Broad, draws from many fields
Write a Comment
User Comments (0)
About PowerShow.com